Search Content

Topology Processing of Retinotopic Maps

Description

Retinotopic map, the map between visual inputs on the retina and neuronal activation in brain visual areas, is one of the central topics in visual neuroscience. For human observers, the map is typically obtained by analyzing functional magnetic resonance imaging (fMRI) signals of cortical responses to slowly moving visual stimuli…

Retinotopic map, the map between visual inputs on the retina and neuronal activation in brain visual areas, is one of the central topics in visual neuroscience. For human observers, the map is typically obtained by analyzing functional magnetic resonance imaging (fMRI) signals of cortical responses to slowly moving visual stimuli on the retina. Biological evidences show the retinotopic mapping is topology-preserving/topological (i.e. keep the neighboring relationship after human brain process) within each visual region. Unfortunately, due to limited spatial resolution and the signal-noise ratio of fMRI, state of art retinotopic map is not topological. The topic was to model the topology-preserving condition mathematically, fix non-topological retinotopic map with numerical methods, and improve the quality of retinotopic maps. The impose of topological condition, benefits several applications. With the topological retinotopic maps, one may have a better insight on human retinotopic maps, including better cortical magnification factor quantification, more precise description of retinotopic maps, and potentially better exam ways of in Ophthalmology clinic.

ContributorsTu, Yanshuai (Author) / Wang, Yalin (Thesis advisor) / Lu, Zhong-Lin (Committee member) / Crook, Sharon (Committee member) / Yang, Yezhou (Committee member) / Zhang, Yu (Committee member) / Arizona State University (Publisher)

Created2022

ARGOS: Adaptive Recognition and Gated Operation System for Real-time Vision Applications

Description

Deep neural network-based methods have been proved to achieve outstanding performance on object detection and classification tasks. Deep neural networks follow the ``deeper model with deeper confidence'' belief to gain a higher recognition accuracy. However, reducing these networks' computational costs remains a challenge, which impedes their deployment on embedded devices.…

Deep neural network-based methods have been proved to achieve outstanding performance on object detection and classification tasks. Deep neural networks follow the ``deeper model with deeper confidence'' belief to gain a higher recognition accuracy. However, reducing these networks' computational costs remains a challenge, which impedes their deployment on embedded devices. For instance, the intersection management of Connected Autonomous Vehicles (CAVs) requires running computationally intensive object recognition algorithms on low-power traffic cameras. This dissertation aims to study the effect of a dynamic hardware and software approach to address this issue. Characteristics of real-world applications can facilitate this dynamic adjustment and reduce the computation. Specifically, this dissertation starts with a dynamic hardware approach that adjusts itself based on the toughness of input and extracts deeper features if needed. Next, an adaptive learning mechanism has been studied that use extracted feature from previous inputs to improve system performance. Finally, a system (ARGOS) was proposed and evaluated that can be run on embedded systems while maintaining the desired accuracy. This system adopts shallow features at inference time, but it can switch to deep features if the system desires a higher accuracy. To improve the performance, ARGOS distills the temporal knowledge from deep features to the shallow system. Moreover, ARGOS reduces the computation furthermore by focusing on regions of interest. The response time and mean average precision are adopted for the performance evaluation to evaluate the proposed ARGOS system.

ContributorsFarhadi, Mohammad (Author) / Yang, Yezhou (Thesis advisor) / Vrudhula, Sarma (Committee member) / Wu, Carole-Jean (Committee member) / Ren, Yi (Committee member) / Arizona State University (Publisher)

Created2022

From SLAM to Spatial AI: Using Implicit 3D Latent Space Landmark Reconstruction for Robot Localization and Mapping

Description

Simultaneous localization and mapping (SLAM) has traditionally relied on low-level geometric or optical features. However, these features-based SLAM methods often struggle with feature-less or repetitive scenes. Additionally, low-level features may not provide sufficient information for robot navigation and manipulation, leaving robots without a complete understanding of the 3D spatial world.…

Simultaneous localization and mapping (SLAM) has traditionally relied on low-level geometric or optical features. However, these features-based SLAM methods often struggle with feature-less or repetitive scenes. Additionally, low-level features may not provide sufficient information for robot navigation and manipulation, leaving robots without a complete understanding of the 3D spatial world. Advanced information is necessary to address these limitations. Fortunately, recent developments in learning-based 3D reconstruction allow robots to not only detect semantic meanings, but also recognize the 3D structure of objects from a few images. By combining this 3D structural information, SLAM can be improved from a low-level approach to a structure-aware approach. This work propose a novel approach for multi-view 3D reconstruction using recurrent transformer. This approach allows robots to accumulate information from multiple views and encode them into a compact latent space. The resulting latent representations are then decoded to produce 3D structural landmarks, which can be used to improve robot localization and mapping.

ContributorsHuang, Chi-Yao (Author) / Yang, Yezhou (Thesis advisor) / Turaga, Pavan (Committee member) / Jayasuriya, Suren (Committee member) / Arizona State University (Publisher)

Created2023

Neural Retriever-Reader for Information Retrieval and Question Answering

Description

In the era of information explosion and multi-modal data, information retrieval (IR) and question answering (QA) systems have become essential in daily human activities. IR systems aim to find relevant information in response to user queries, while QA systems provide concise and accurate answers to user questions. IR and…

In the era of information explosion and multi-modal data, information retrieval (IR) and question answering (QA) systems have become essential in daily human activities. IR systems aim to find relevant information in response to user queries, while QA systems provide concise and accurate answers to user questions. IR and QA are two of the most crucial challenges in the realm of Artificial Intelligence (AI), with wide-ranging real-world applications such as search engines and dialogue systems. This dissertation investigates and develops novel models and training objectives to enhance current retrieval systems in textual and multi-modal contexts. Moreover, it examines QA systems, emphasizing generalization and robustness, and creates new benchmarks to promote their progress. Neural retrievers have surfaced as a viable solution, capable of surpassing the constraints of traditional term-matching search algorithms. This dissertation presents Poly-DPR, an innovative multi-vector model architecture that manages test-query, and ReViz, a comprehensive multimodal model to tackle multi-modality queries. By utilizing IR-focused pretraining tasks and producing large-scale training data, the proposed methodology substantially improves the abilities of existing neural retrievers.Concurrently, this dissertation investigates the realm of QA systems, referred to as ``readers'', by performing an exhaustive analysis of current extractive and generative readers, which results in a reliable guidance for selecting readers for downstream applications. Additionally, an original reader (Two-in-One) is designed to effectively choose the pertinent passages and sentences from a pool of candidates for multi-hop reasoning. This dissertation also acknowledges the significance of logical reasoning in real-world applications and has developed a comprehensive testbed, LogiGLUE, to further the advancement of reasoning capabilities in QA systems.

ContributorsLuo, Man (Author) / Baral, Chitta (Thesis advisor) / Yang, Yezhou (Committee member) / Blanco, Eduardo (Committee member) / Chen, Danqi (Committee member) / Arizona State University (Publisher)

Created2023

Multimodal Robot Learning for Grasping and Manipulation

Description

Enabling robots to physically engage with their environment in a safe and efficient manner is an essential step towards human-robot interaction. To date, robots usually operate as pre-programmed workers that blindly execute tasks in highly structured environments crafted by skilled engineers. Changing the robots’ behavior to cover new duties or…

Enabling robots to physically engage with their environment in a safe and efficient manner is an essential step towards human-robot interaction. To date, robots usually operate as pre-programmed workers that blindly execute tasks in highly structured environments crafted by skilled engineers. Changing the robots’ behavior to cover new duties or handle variability is an expensive, complex, and time-consuming process. However, with the advent of more complex sensors and algorithms, overcoming these limitations becomes within reach. This work proposes innovations in artificial intelligence, language understanding, and multimodal integration to enable next-generation grasping and manipulation capabilities in autonomous robots. The underlying thesis is that multimodal observations and instructions can drastically expand the responsiveness and dexterity of robot manipulators. Natural language, in particular, can be used to enable intuitive, bidirectional communication between a human user and the machine. To this end, this work presents a system that learns context-aware robot control policies from multimodal human demonstrations. Among the main contributions presented are techniques for (a) collecting demonstrations in an efficient and intuitive fashion, (b) methods for leveraging physical contact with the environment and objects, (c) the incorporation of natural language to understand context, and (d) the generation of robust robot control policies. The presented approach and systems are evaluated in multiple grasping and manipulation settings ranging from dexterous manipulation to pick-and-place, as well as contact-rich bimanual insertion tasks. Moreover, the usability of these innovations, especially when utilizing human task demonstrations and communication interfaces, is evaluated in several human-subject studies.

ContributorsStepputtis, Simon (Author) / Ben Amor, Heni (Thesis advisor) / Baral, Chitta (Committee member) / Yang, Yezhou (Committee member) / Lee, Stefan (Committee member) / Arizona State University (Publisher)

Created2021

QU-Net: A Lightweight U-Net based Region Proposal System

Description

In recent years, there has been significant progress in deep learning and computer vision, with many models proposed that have achieved state-of-art results on various image recognition tasks. However, to explore the full potential of the advances in this field, there is an urgent need to push the processing of…

In recent years, there has been significant progress in deep learning and computer vision, with many models proposed that have achieved state-of-art results on various image recognition tasks. However, to explore the full potential of the advances in this field, there is an urgent need to push the processing of deep networks from the cloud to edge devices. Unfortunately, many deep learning models cannot be efficiently implemented on edge devices as these devices are severely resource-constrained. In this thesis, I present QU-Net, a lightweight binary segmentation model based on the U-Net architecture. Traditionally, neural networks consider the entire image to be significant. However, in real-world scenarios, many regions in an image do not contain any objects of significance. These regions can be removed from the original input allowing a network to focus on the relevant regions and thus reduce computational costs. QU-Net proposes the salient regions (binary mask) that the deeper models can use as the input. Experiments show that QU-Net helped achieve a computational reduction of 25% on the Microsoft Common Objects in Context (MS COCO) dataset and 57% on the Cityscapes dataset. Moreover, QU-Net is a generalizable model that outperforms other similar works, such as Dynamic Convolutions.

ContributorsSanthosh Kumar Varma, Rahul (Author) / Yang, Yezhou (Thesis advisor) / Fan, Deliang (Committee member) / Yang, Yingzhen (Committee member) / Arizona State University (Publisher)

Created2021

Attributable Watermarking of Speech Generative Models

Description

Generative models in various domain such as images, speeches, and videos are beingdeveloped actively over the last decades and recent deep generative models are now capable of synthesizing multimedia contents are difficult to be distinguishable from authentic contents. Such capabilities cause concerns such as malicious impersonation, Intellectual property theft(IP theft) and copyright infringement. One…

Generative models in various domain such as images, speeches, and videos are beingdeveloped actively over the last decades and recent deep generative models are now capable of synthesizing multimedia contents are difficult to be distinguishable from authentic contents. Such capabilities cause concerns such as malicious impersonation, Intellectual property theft(IP theft) and copyright infringement. One method to solve these threats is to embedded attributable watermarking in synthesized contents so that user can identify the user-end models where the contents are generated from. This paper investigates a solution for model attribution, i.e., the classification of synthetic contents by their source models via watermarks embedded in the contents. Existing studies showed the feasibility of model attribution in the image domain and tradeoff between attribution accuracy and generation quality under the various adversarial attacks but not in speech domain. This work discuss the feasibility of model attribution in different domain and algorithmic improvements for generating user-end speech models that empirically achieve high accuracy of attribution while maintaining high generation quality. Lastly, several experiments are conducted show the tradeoff between attributability and generation quality under a variety of attacks on generated speech signals attempting to remove the watermarks.

ContributorsCho, Yongbaek (Author) / Yang, Yezhou (Thesis advisor) / Ren, Yi (Committee member) / Trieu, Ni (Committee member) / Arizona State University (Publisher)

Created2021

Implicit Hypothetical Reasoning about Intrinsic Physical Properties

Description

Multimodal reasoning is one of the most interesting research fields because of the ability to interact with systems and the explainability of the models' behavior. Traditional multimodal research problems do not focus on complex commonsense reasoning (such as physical interactions). Although real-world objects have physical properties associated with them,…

Multimodal reasoning is one of the most interesting research fields because of the ability to interact with systems and the explainability of the models' behavior. Traditional multimodal research problems do not focus on complex commonsense reasoning (such as physical interactions). Although real-world objects have physical properties associated with them, many of these properties (such as mass and coefficient of friction) are not captured directly by the imaging pipeline. Videos often capture objects, their motion, and the interactions between different objects. However, these properties can be estimated by utilizing cues from relative object motion and the dynamics introduced by collisions. This thesis introduces a new video question-answering task for reasoning about the implicit physical properties of objects in a scene, from videos. For this task, I introduce a dataset -- CRIPP-VQA (Counterfactual Reasoning about Implicit Physical Properties - Video Question Answering), which contains videos of objects in motion, annotated with hypothetical/counterfactual questions about the effect of actions (such as removing, adding, or replacing objects), questions about planning (choosing actions to perform to reach a particular goal), as well as descriptive questions about the visible properties of objects. Further, I benchmark the performance of existing video question-answering models on two test settings of CRIPP-VQA: i.i.d. and an out-of-distribution setting which contains objects with values of mass, coefficient of friction, and initial velocities that are not seen in the training distribution. Experiments reveal a surprising and significant performance gap in terms of answering questions about implicit properties (the focus of this thesis) and explicit properties (the focus of prior work) of objects.

ContributorsPatel, Maitreya Jitendra (Author) / Yang, Yezhou (Thesis advisor) / Baral, Chitta (Committee member) / Lee, Kookjin (Committee member) / Arizona State University (Publisher)

Created2022

AvaCAR

Description

For a system of autonomous vehicles functioning together in a traffic scene, 3Dunderstanding of participants in the field of view or surrounding is very essential for assessing the safety operation of the involved. This problem can be decomposed into online pose and shape estimation, which has been a core research area of…

For a system of autonomous vehicles functioning together in a traffic scene, 3Dunderstanding of participants in the field of view or surrounding is very essential for assessing the safety operation of the involved. This problem can be decomposed into online pose and shape estimation, which has been a core research area of computer vision for over a decade now. This work is an add-on to support and improve the joint estimate of the pose and shape of vehicles from monocular cameras. The objective of jointly estimating the vehicle pose and shape online is enabled by what is called an offline reconstruction pipeline. In the offline reconstruction step, an approach to obtain the vehicle 3D shape with keypoints labeled is formulated. This work proposes a multi-view reconstruction pipeline using images and masks which can create an approximate shape of vehicles and can be used as a shape prior. Then a 3D model-fitting optimization approach to refine the shape prior using high quality computer-aided design (CAD) models of vehicles is developed. A dataset of such 3D vehicles with 20 keypoints annotated is prepared and call it the AvaCAR dataset. The AvaCAR dataset can be used to estimate the vehicle shape and pose, without having the need to collect significant amounts of data needed for adequate training of a neural network. The online reconstruction can use this synthesis dataset to generate novel viewpoints and simultaneously train a neural network for pose and shape estimation. Most methods in the current literature using deep neural networks, that are trained to estimate pose of the object from a single image, are inherently biased to the viewpoint of the images used. This approach aims at addressing these existing limitations in the current method by delivering the online estimation a shape prior which can generate novel views to account for the bias due to viewpoint. The dataset is provided with ground truth extrinsic parameters and the compact vector based shape representations which along with the multi-view dataset can be used to efficiently trained neural networks for vehicle pose and shape estimation. The vehicles in this library are evaluated with some standard metrics to assure they are capable of aiding online estimation and model based tracking.

ContributorsDUTTA, PRABAL BIJOY (Author) / Yang, Yezhou (Thesis advisor) / Berman, Spring (Committee member) / Lu, Duo (Committee member) / Arizona State University (Publisher)

Created2022

Mining IoT Network Traffic in Smart Homes: Traffic Measurement, Pattern Recognition, and Security Applications

Description

Recent advances in cyber-physical systems, artificial intelligence, and cloud computing have driven the widespread deployment of Internet-of-Things (IoT) devices in smart homes. However, the spate of cyber attacks exploiting the vulnerabilities and weak security management of smart home IoT devices have highlighted the urgency and challenges of designing efficient mechanisms…

Recent advances in cyber-physical systems, artificial intelligence, and cloud computing have driven the widespread deployment of Internet-of-Things (IoT) devices in smart homes. However, the spate of cyber attacks exploiting the vulnerabilities and weak security management of smart home IoT devices have highlighted the urgency and challenges of designing efficient mechanisms for detecting, analyzing, and mitigating security threats towards them. In this dissertation, I seek to address the security and privacy issues of smart home IoT devices from the perspectives of traffic measurement, pattern recognition, and security applications. I first propose an efficient multidimensional smart home network traffic measurement framework, which enables me to deeply understand the smart home IoT ecosystem and detect various vulnerabilities and flaws. I further design intelligent schemes to efficiently extract security-related IoT device event and user activity patterns from the encrypted smart home network traffic. Based on the knowledge of how smart home operates, different systems for securing smart home networks are proposed and implemented, including abnormal network traffic detection across multiple IoT networking protocol layers, smart home safety monitoring with extracted spatial information about IoT device events, and system-level IoT vulnerability analysis and network hardening.

ContributorsWan, Yinxin (Author) / Xue, Guoliang (Thesis advisor) / Xu, Kuai (Thesis advisor) / Yang, Yezhou (Committee member) / Zhang, Yanchao (Committee member) / Arizona State University (Publisher)

Created2023

Filtering by