Theses and Dissertations
Displaying 1 - 3 of 3
Filtering by
- Creators: Yang, Yezhou
Description
Simultaneous localization and mapping (SLAM) has traditionally relied on low-level geometric or optical features. However, these features-based SLAM methods often struggle with feature-less or repetitive scenes. Additionally, low-level features may not provide sufficient information for robot navigation and manipulation, leaving robots without a complete understanding of the 3D spatial world. Advanced information is necessary to address these limitations. Fortunately, recent developments in learning-based 3D reconstruction allow robots to not only detect semantic meanings, but also recognize the 3D structure of objects from a few images. By combining this 3D structural information, SLAM can be improved from a low-level approach to a structure-aware approach. This work propose a novel approach for multi-view 3D reconstruction using recurrent transformer. This approach allows robots to accumulate information from multiple views and encode them into a compact latent space. The resulting latent representations are then decoded to produce 3D structural landmarks, which can be used to improve robot localization and mapping.
ContributorsHuang, Chi-Yao (Author) / Yang, Yezhou (Thesis advisor) / Turaga, Pavan (Committee member) / Jayasuriya, Suren (Committee member) / Arizona State University (Publisher)
Created2023
Description
I present my work on a scalable and programmable I/O controller for region-based computing, which will be used in a rhythmic pixel-based camera pipeline. I provide a breakdown of the development and design of the I/O controller and how it fits in to rhythmic pixel regions, along with a studyon memory traffic of rhythmic pixel regions and how this translates to energy efficiency. This rhythmic pixel region-based camera pipeline has been jointly developed through Dr. Robert LiKamWa’s research lab. High spatiotemporal resolutions allow high precision for vision applications, such as for detecting features for augmented reality or face detection. High spatiotemporal resolution also comes with high memory throughput, leading to higher energy usage. This creates a tradeoff between high precision and energy efficiency, which becomes more important in mobile systems. In addition, not all pixels in a frame are necessary for the vision application, such as pixels that make up the background. Rhythmic pixel regions aim to reduce the tradeoff by creating a pipeline that allows an application developer to specify regions to capture at a non-uniform spatiotemporal resolution. This is accomplished by encoding the incoming image, and only sending the pixels within these specified regions. Later these encoded representations will be decoded to a standard frame representation usable by traditional vision applications. My contribution to this effort has been the design, testing and evaluation of the I/O controller.
ContributorsNguyen, Van (Author) / LiKamWa, Robert (Thesis advisor) / Jayasuriya, Suren (Committee member) / Yang, Yezhou (Committee member) / Arizona State University (Publisher)
Created2020
Description
It is not merely an aggregation of static entities that a video clip carries, but alsoa variety of interactions and relations among these entities. Challenges still remain
for a video captioning system to generate natural language descriptions focusing on
the prominent interest and aligning with the latent aspects beyond observations. This
work presents a Commonsense knowledge Anchored Video cAptioNing (dubbed as
CAVAN) approach. CAVAN exploits inferential commonsense knowledge to assist the
training of video captioning model with a novel paradigm for sentence-level semantic
alignment. Specifically, commonsense knowledge is queried to complement per training
caption by querying a generic knowledge atlas ATOMIC, and form the commonsense-
caption entailment corpus. A BERT based language entailment model trained from
this corpus then serves as a commonsense discriminator for the training of video
captioning model, and penalizes the model from generating semantically misaligned
captions. With extensive empirical evaluations on MSR-VTT, V2C and VATEX
datasets, CAVAN consistently improves the quality of generations and shows higher
keyword hit rate. Experimental results with ablations validate the effectiveness of
CAVAN and reveals that the use of commonsense knowledge contributes to the video
caption generation.
ContributorsShao, Huiliang (Author) / Yang, Yezhou (Thesis advisor) / Jayasuriya, Suren (Committee member) / Xiao, Chaowei (Committee member) / Arizona State University (Publisher)
Created2022