2026-05-23T14:47:06Zhttps://keep.lib.asu.edu/oai/request

oai:keep.lib.asu.edu:node-2024552025-08-18T22:22:09Zoai_pmh:alloai_pmh:repo_items

202455 https://hdl.handle.net/2286/R.2.N.202455 http://rightsstatements.org/vocab/InC/1.0/ All Rights Reserved 2025 89 pages Masters Thesis Academic theses en Sahoo, Karam Kumar Seifi, Hasti Fazli, Pooyan Bryan, Chris Arizona State University Partial requirement for: M.S., Arizona State University, 2025 Field of study: Computer Science Blind and Low Vision (BLV) users face significant challenges in perceiving and interacting with spatial information in Virtual Reality (VR) environments, which are primarily designed around visual cues. Context-aware descriptions can help bridge this gap by providing real-time, relevant, and personalized information. This study introduces Third-AI, a Visual Question Answering (VQA) framework that leverages state-of-the-art Vision-Language Models (VLMs) to enhance accessibility and usability for eyes-free navigation in VR. The research focuses on two primary goals: (i) generating real-time contextual scene descriptions based on user intent using VLMs and prompt engineering; and (ii) developing an audio-based interface for seamless interaction using head-mounted displays (HMDs), minimizing cognitive effort. By integrating GPT-4o and computer vision models, Third-AI delivers contextually rich and real-time scene descriptions in response to user queries, improving spatial awareness in VR. To evaluate its effectiveness, the study conducts user studies across two conditions: with 3 BLV participants and 10 sighted participants in simulated low vision virtual scenes. Quantitative results showed high usability scores, with average SUS scores of μBLV = 81.33 (σ = 6.29) and μSim = 80.56 (σ = 13.85). Overall NASA Task Load Index (NASA-TLX) workload scores were μBLV = 33.78 (σ = 7.34) and μSim = 28.36 (σ = 8.98), where lower scores indicate reduced perceived cognitive workload during interaction. MEC Spatial Presence Questionnaire (MEC-SPQ) scores indicated a high sense of spatial presence with means of μBLV = 4.25 and μSim = 4.48. Qualitative results from post-study interviews revealed that users found the audio-first interface intuitive, even for those with no prior VR experience. Users appreciated the ability to receive detailed yet efficient descriptions tailored to their visual queries, though some emphasized the need for more conversational and humanized VQA responses. These findings suggest that real-time, audio-based VQA systems hold strong promise for enhancing accessibility in immersive VR settings. Computer Science Accessibility BLV VLM VQA VR Third-AI: Audio-First Visual Question Answering System for Eyes-Free Exploration in Virtual Reality