Matching Items (109)
Filtering by

Clear all filters

187694-Thumbnail Image.png
Description
In the era of information explosion and multi-modal data, information retrieval (IR) and question answering (QA) systems have become essential in daily human activities. IR systems aim to find relevant information in response to user queries, while QA systems provide concise and accurate answers to user questions. IR and

In the era of information explosion and multi-modal data, information retrieval (IR) and question answering (QA) systems have become essential in daily human activities. IR systems aim to find relevant information in response to user queries, while QA systems provide concise and accurate answers to user questions. IR and QA are two of the most crucial challenges in the realm of Artificial Intelligence (AI), with wide-ranging real-world applications such as search engines and dialogue systems. This dissertation investigates and develops novel models and training objectives to enhance current retrieval systems in textual and multi-modal contexts. Moreover, it examines QA systems, emphasizing generalization and robustness, and creates new benchmarks to promote their progress. Neural retrievers have surfaced as a viable solution, capable of surpassing the constraints of traditional term-matching search algorithms. This dissertation presents Poly-DPR, an innovative multi-vector model architecture that manages test-query, and ReViz, a comprehensive multimodal model to tackle multi-modality queries. By utilizing IR-focused pretraining tasks and producing large-scale training data, the proposed methodology substantially improves the abilities of existing neural retrievers.Concurrently, this dissertation investigates the realm of QA systems, referred to as ``readers'', by performing an exhaustive analysis of current extractive and generative readers, which results in a reliable guidance for selecting readers for downstream applications. Additionally, an original reader (Two-in-One) is designed to effectively choose the pertinent passages and sentences from a pool of candidates for multi-hop reasoning. This dissertation also acknowledges the significance of logical reasoning in real-world applications and has developed a comprehensive testbed, LogiGLUE, to further the advancement of reasoning capabilities in QA systems.
ContributorsLuo, Man (Author) / Baral, Chitta (Thesis advisor) / Yang, Yezhou (Committee member) / Blanco, Eduardo (Committee member) / Chen, Danqi (Committee member) / Arizona State University (Publisher)
Created2023
168406-Thumbnail Image.png
Description
Enabling robots to physically engage with their environment in a safe and efficient manner is an essential step towards human-robot interaction. To date, robots usually operate as pre-programmed workers that blindly execute tasks in highly structured environments crafted by skilled engineers. Changing the robots’ behavior to cover new duties or

Enabling robots to physically engage with their environment in a safe and efficient manner is an essential step towards human-robot interaction. To date, robots usually operate as pre-programmed workers that blindly execute tasks in highly structured environments crafted by skilled engineers. Changing the robots’ behavior to cover new duties or handle variability is an expensive, complex, and time-consuming process. However, with the advent of more complex sensors and algorithms, overcoming these limitations becomes within reach. This work proposes innovations in artificial intelligence, language understanding, and multimodal integration to enable next-generation grasping and manipulation capabilities in autonomous robots. The underlying thesis is that multimodal observations and instructions can drastically expand the responsiveness and dexterity of robot manipulators. Natural language, in particular, can be used to enable intuitive, bidirectional communication between a human user and the machine. To this end, this work presents a system that learns context-aware robot control policies from multimodal human demonstrations. Among the main contributions presented are techniques for (a) collecting demonstrations in an efficient and intuitive fashion, (b) methods for leveraging physical contact with the environment and objects, (c) the incorporation of natural language to understand context, and (d) the generation of robust robot control policies. The presented approach and systems are evaluated in multiple grasping and manipulation settings ranging from dexterous manipulation to pick-and-place, as well as contact-rich bimanual insertion tasks. Moreover, the usability of these innovations, especially when utilizing human task demonstrations and communication interfaces, is evaluated in several human-subject studies.
ContributorsStepputtis, Simon (Author) / Ben Amor, Heni (Thesis advisor) / Baral, Chitta (Committee member) / Yang, Yezhou (Committee member) / Lee, Stefan (Committee member) / Arizona State University (Publisher)
Created2021
168367-Thumbnail Image.png
Description
In recent years, there has been significant progress in deep learning and computer vision, with many models proposed that have achieved state-of-art results on various image recognition tasks. However, to explore the full potential of the advances in this field, there is an urgent need to push the processing of

In recent years, there has been significant progress in deep learning and computer vision, with many models proposed that have achieved state-of-art results on various image recognition tasks. However, to explore the full potential of the advances in this field, there is an urgent need to push the processing of deep networks from the cloud to edge devices. Unfortunately, many deep learning models cannot be efficiently implemented on edge devices as these devices are severely resource-constrained. In this thesis, I present QU-Net, a lightweight binary segmentation model based on the U-Net architecture. Traditionally, neural networks consider the entire image to be significant. However, in real-world scenarios, many regions in an image do not contain any objects of significance. These regions can be removed from the original input allowing a network to focus on the relevant regions and thus reduce computational costs. QU-Net proposes the salient regions (binary mask) that the deeper models can use as the input. Experiments show that QU-Net helped achieve a computational reduction of 25% on the Microsoft Common Objects in Context (MS COCO) dataset and 57% on the Cityscapes dataset. Moreover, QU-Net is a generalizable model that outperforms other similar works, such as Dynamic Convolutions.
ContributorsSanthosh Kumar Varma, Rahul (Author) / Yang, Yezhou (Thesis advisor) / Fan, Deliang (Committee member) / Yang, Yingzhen (Committee member) / Arizona State University (Publisher)
Created2021
168441-Thumbnail Image.png
Description
Generative models in various domain such as images, speeches, and videos are beingdeveloped actively over the last decades and recent deep generative models are now capable of synthesizing multimedia contents are difficult to be distinguishable from authentic contents. Such capabilities cause concerns such as malicious impersonation, Intellectual property theft(IP theft) and copyright infringement. One

Generative models in various domain such as images, speeches, and videos are beingdeveloped actively over the last decades and recent deep generative models are now capable of synthesizing multimedia contents are difficult to be distinguishable from authentic contents. Such capabilities cause concerns such as malicious impersonation, Intellectual property theft(IP theft) and copyright infringement. One method to solve these threats is to embedded attributable watermarking in synthesized contents so that user can identify the user-end models where the contents are generated from. This paper investigates a solution for model attribution, i.e., the classification of synthetic contents by their source models via watermarks embedded in the contents. Existing studies showed the feasibility of model attribution in the image domain and tradeoff between attribution accuracy and generation quality under the various adversarial attacks but not in speech domain. This work discuss the feasibility of model attribution in different domain and algorithmic improvements for generating user-end speech models that empirically achieve high accuracy of attribution while maintaining high generation quality. Lastly, several experiments are conducted show the tradeoff between attributability and generation quality under a variety of attacks on generated speech signals attempting to remove the watermarks.
ContributorsCho, Yongbaek (Author) / Yang, Yezhou (Thesis advisor) / Ren, Yi (Committee member) / Trieu, Ni (Committee member) / Arizona State University (Publisher)
Created2021
171495-Thumbnail Image.png
Description
Multimodal reasoning is one of the most interesting research fields because of the ability to interact with systems and the explainability of the models' behavior. Traditional multimodal research problems do not focus on complex commonsense reasoning (such as physical interactions). Although real-world objects have physical properties associated with them,

Multimodal reasoning is one of the most interesting research fields because of the ability to interact with systems and the explainability of the models' behavior. Traditional multimodal research problems do not focus on complex commonsense reasoning (such as physical interactions). Although real-world objects have physical properties associated with them, many of these properties (such as mass and coefficient of friction) are not captured directly by the imaging pipeline. Videos often capture objects, their motion, and the interactions between different objects. However, these properties can be estimated by utilizing cues from relative object motion and the dynamics introduced by collisions. This thesis introduces a new video question-answering task for reasoning about the implicit physical properties of objects in a scene, from videos. For this task, I introduce a dataset -- CRIPP-VQA (Counterfactual Reasoning about Implicit Physical Properties - Video Question Answering), which contains videos of objects in motion, annotated with hypothetical/counterfactual questions about the effect of actions (such as removing, adding, or replacing objects), questions about planning (choosing actions to perform to reach a particular goal), as well as descriptive questions about the visible properties of objects. Further, I benchmark the performance of existing video question-answering models on two test settings of CRIPP-VQA: i.i.d. and an out-of-distribution setting which contains objects with values of mass, coefficient of friction, and initial velocities that are not seen in the training distribution. Experiments reveal a surprising and significant performance gap in terms of answering questions about implicit properties (the focus of this thesis) and explicit properties (the focus of prior work) of objects.
ContributorsPatel, Maitreya Jitendra (Author) / Yang, Yezhou (Thesis advisor) / Baral, Chitta (Committee member) / Lee, Kookjin (Committee member) / Arizona State University (Publisher)
Created2022
171810-Thumbnail Image.png
Description
For a system of autonomous vehicles functioning together in a traffic scene, 3Dunderstanding of participants in the field of view or surrounding is very essential for assessing the safety operation of the involved. This problem can be decomposed into online pose and shape estimation, which has been a core research area of

For a system of autonomous vehicles functioning together in a traffic scene, 3Dunderstanding of participants in the field of view or surrounding is very essential for assessing the safety operation of the involved. This problem can be decomposed into online pose and shape estimation, which has been a core research area of computer vision for over a decade now. This work is an add-on to support and improve the joint estimate of the pose and shape of vehicles from monocular cameras. The objective of jointly estimating the vehicle pose and shape online is enabled by what is called an offline reconstruction pipeline. In the offline reconstruction step, an approach to obtain the vehicle 3D shape with keypoints labeled is formulated. This work proposes a multi-view reconstruction pipeline using images and masks which can create an approximate shape of vehicles and can be used as a shape prior. Then a 3D model-fitting optimization approach to refine the shape prior using high quality computer-aided design (CAD) models of vehicles is developed. A dataset of such 3D vehicles with 20 keypoints annotated is prepared and call it the AvaCAR dataset. The AvaCAR dataset can be used to estimate the vehicle shape and pose, without having the need to collect significant amounts of data needed for adequate training of a neural network. The online reconstruction can use this synthesis dataset to generate novel viewpoints and simultaneously train a neural network for pose and shape estimation. Most methods in the current literature using deep neural networks, that are trained to estimate pose of the object from a single image, are inherently biased to the viewpoint of the images used. This approach aims at addressing these existing limitations in the current method by delivering the online estimation a shape prior which can generate novel views to account for the bias due to viewpoint. The dataset is provided with ground truth extrinsic parameters and the compact vector based shape representations which along with the multi-view dataset can be used to efficiently trained neural networks for vehicle pose and shape estimation. The vehicles in this library are evaluated with some standard metrics to assure they are capable of aiding online estimation and model based tracking.
ContributorsDUTTA, PRABAL BIJOY (Author) / Yang, Yezhou (Thesis advisor) / Berman, Spring (Committee member) / Lu, Duo (Committee member) / Arizona State University (Publisher)
Created2022
171818-Thumbnail Image.png
Description
Recent advances in autonomous vehicle (AV) technologies have ensured that autonomous driving will soon be present in real-world traffic. Despite the potential of AVs, many studies have shown that traffic accidents in hybrid traffic environments (where both AVs and human-driven vehicles (HVs) are present) are inevitable because of the unpredictability

Recent advances in autonomous vehicle (AV) technologies have ensured that autonomous driving will soon be present in real-world traffic. Despite the potential of AVs, many studies have shown that traffic accidents in hybrid traffic environments (where both AVs and human-driven vehicles (HVs) are present) are inevitable because of the unpredictability of human-driven vehicles. Given that eliminating accidents is impossible, an achievable goal of designing AVs is to design them in a way so that they will not be blamed for any accident in which they are involved in. This work proposes BlaFT – a Blame-Free motion planning algorithm in hybrid Traffic. BlaFT is designed to be compatible with HVs and other AVs, and will not be blamed for accidents in a structured road environment. Also, it proves that no accidents will happen if all AVs are using the BlaFT motion planner and that when in hybrid traffic, the AV using BlaFT will be blame-free even if it is involved in a collision. The work instantiated scores of BlaFT and HV vehicles in an urban road scape loop in the 'Simulation of Urban MObility', ran the simulation for several hours, and observe that as the percentage of BlaFT vehicles increases, the traffic becomes safer. Adding BlaFT vehicles to HVs also increases the efficiency of traffic as a whole by up to 34%.
ContributorsPark, Sanggu (Author) / Shrivastava, Aviral (Thesis advisor) / Wang, Ruoyu (Committee member) / Yang, Yezhou (Committee member) / Arizona State University (Publisher)
Created2022
189245-Thumbnail Image.png
Description
Recent advances in cyber-physical systems, artificial intelligence, and cloud computing have driven the widespread deployment of Internet-of-Things (IoT) devices in smart homes. However, the spate of cyber attacks exploiting the vulnerabilities and weak security management of smart home IoT devices have highlighted the urgency and challenges of designing efficient mechanisms

Recent advances in cyber-physical systems, artificial intelligence, and cloud computing have driven the widespread deployment of Internet-of-Things (IoT) devices in smart homes. However, the spate of cyber attacks exploiting the vulnerabilities and weak security management of smart home IoT devices have highlighted the urgency and challenges of designing efficient mechanisms for detecting, analyzing, and mitigating security threats towards them. In this dissertation, I seek to address the security and privacy issues of smart home IoT devices from the perspectives of traffic measurement, pattern recognition, and security applications. I first propose an efficient multidimensional smart home network traffic measurement framework, which enables me to deeply understand the smart home IoT ecosystem and detect various vulnerabilities and flaws. I further design intelligent schemes to efficiently extract security-related IoT device event and user activity patterns from the encrypted smart home network traffic. Based on the knowledge of how smart home operates, different systems for securing smart home networks are proposed and implemented, including abnormal network traffic detection across multiple IoT networking protocol layers, smart home safety monitoring with extracted spatial information about IoT device events, and system-level IoT vulnerability analysis and network hardening.
ContributorsWan, Yinxin (Author) / Xue, Guoliang (Thesis advisor) / Xu, Kuai (Thesis advisor) / Yang, Yezhou (Committee member) / Zhang, Yanchao (Committee member) / Arizona State University (Publisher)
Created2023
189209-Thumbnail Image.png
Description
In natural language processing, language models have achieved remarkable success over the last few years. The Transformers are at the core of most of these models. Their success can be mainly attributed to an enormous amount of curated data they are trained on. Even though such language models are trained

In natural language processing, language models have achieved remarkable success over the last few years. The Transformers are at the core of most of these models. Their success can be mainly attributed to an enormous amount of curated data they are trained on. Even though such language models are trained on massive curated data, they often need specific extracted knowledge to understand better and reason. This is because often relevant knowledge may be implicit or missing, which hampers machine reasoning. Apart from that, manual knowledge curation is time-consuming and erroneous. Hence, finding fast and effective methods to extract such knowledge from data is important for improving language models. This leads to finding ideal ways to utilize such knowledge by incorporating them into language models. Successful knowledge extraction and integration lead to an important question of knowledge evaluation of such models by developing tools or introducing challenging test suites to learn about their limitations and improve them further. So to improve the transformer-based models, understanding the role of knowledge becomes important. In the pursuit to improve language models with knowledge, in this dissertation I study three broad research directions spanning across the natural language, biomedical and cybersecurity domains: (1) Knowledge Extraction (KX) - How can transformer-based language models be leveraged to extract knowledge from data? (2) Knowledge Integration (KI) - How can such specific knowledge be used to improve such models? (3) Knowledge Evaluation (KE) - How can language models be evaluated for specific skills and understand their limitations? I propose methods to extract explicit textual, implicit structural, missing textual, and missing structural knowledge from natural language and binary programs using transformer-based language models. I develop ways to improve the language model’s multi-step and commonsense reasoning abilities using external knowledge. Finally, I develop challenging datasets which assess their numerical reasoning skills in both in-domain and out-of-domain settings.
ContributorsPal, Kuntal Kumar (Author) / Baral, Chitta (Thesis advisor) / Wang, Ruoyu (Committee member) / Blanco, Eduardo (Committee member) / Yang, Yezhou (Committee member) / Arizona State University (Publisher)
Created2023
168624-Thumbnail Image.png
Description
How to teach a machine to understand natural language? This question is a long-standing challenge in Artificial Intelligence. Several tasks are designed to measure the progress of this challenge. Question Answering is one such task that evaluates a machine's ability to understand natural language, where it reads a passage of

How to teach a machine to understand natural language? This question is a long-standing challenge in Artificial Intelligence. Several tasks are designed to measure the progress of this challenge. Question Answering is one such task that evaluates a machine's ability to understand natural language, where it reads a passage of text or an image and answers comprehension questions. In recent years, the development of transformer-based language models and large-scale human-annotated datasets has led to remarkable progress in the field of question answering. However, several disadvantages of fully supervised question answering systems have been observed. Such as generalizing to unseen out-of-distribution domains, linguistic style differences in questions, and adversarial samples. This thesis proposes implicitly supervised question answering systems trained using knowledge acquisition from external knowledge sources and new learning methods that provide inductive biases to learn question answering. In particular, the following research projects are discussed: (1) Knowledge Acquisition methods: these include semantic and abductive information retrieval for seeking missing knowledge, a method to represent unstructured text corpora as a knowledge graph, and constructing a knowledge base for implicit commonsense reasoning. (2) Learning methods: these include Knowledge Triplet Learning, a method over knowledge graphs; Test-Time Learning, a method to generalize to an unseen out-of-distribution context; WeaQA, a method to learn visual question answering using image captions without strong supervision; WeaSel, weakly supervised method for relative spatial reasoning; and a new paradigm for unsupervised natural language inference. These methods potentially provide a new research direction to overcome the pitfalls of direct supervision.
ContributorsBanerjee, Pratyay (Author) / Baral, Chitta (Thesis advisor) / Yang, Yezhou (Committee member) / Blanco, Eduardo (Committee member) / Li, Baoxin (Committee member) / Arizona State University (Publisher)
Created2022