Matching Items (10)
Filtering by

Clear all filters

168821-Thumbnail Image.png
Description
It is not merely an aggregation of static entities that a video clip carries, but alsoa variety of interactions and relations among these entities. Challenges still remain for a video captioning system to generate natural language descriptions focusing on the prominent interest and aligning with the latent aspects beyond observations. This work presents

It is not merely an aggregation of static entities that a video clip carries, but alsoa variety of interactions and relations among these entities. Challenges still remain for a video captioning system to generate natural language descriptions focusing on the prominent interest and aligning with the latent aspects beyond observations. This work presents a Commonsense knowledge Anchored Video cAptioNing (dubbed as CAVAN) approach. CAVAN exploits inferential commonsense knowledge to assist the training of video captioning model with a novel paradigm for sentence-level semantic alignment. Specifically, commonsense knowledge is queried to complement per training caption by querying a generic knowledge atlas ATOMIC, and form the commonsense- caption entailment corpus. A BERT based language entailment model trained from this corpus then serves as a commonsense discriminator for the training of video captioning model, and penalizes the model from generating semantically misaligned captions. With extensive empirical evaluations on MSR-VTT, V2C and VATEX datasets, CAVAN consistently improves the quality of generations and shows higher keyword hit rate. Experimental results with ablations validate the effectiveness of CAVAN and reveals that the use of commonsense knowledge contributes to the video caption generation.
ContributorsShao, Huiliang (Author) / Yang, Yezhou (Thesis advisor) / Jayasuriya, Suren (Committee member) / Xiao, Chaowei (Committee member) / Arizona State University (Publisher)
Created2022
193564-Thumbnail Image.png
Description
Manipulator motion planning has conventionally been solved using sampling and optimization-based algorithms that are agnostic to embodiment and environment configurations. However, these algorithms plan on a fixed environment representation approximated using shape primitives, and hence struggle to find solutions for cluttered and dynamic environments. Furthermore, these algorithms fail to produce

Manipulator motion planning has conventionally been solved using sampling and optimization-based algorithms that are agnostic to embodiment and environment configurations. However, these algorithms plan on a fixed environment representation approximated using shape primitives, and hence struggle to find solutions for cluttered and dynamic environments. Furthermore, these algorithms fail to produce solutions for complex unstructured environments under real-time bounds. Neural Motion Planners (NMPs) are an appealing alternative to algorithmic approaches as they can leverage parallel computing for planning while incorporating arbitrary environmental constraints directly from raw sensor observations. Contemporary NMPs successfully transfer to different environment variations, however, fail to generalize across embodiments. This thesis proposes "AnyNMP'', a generalist motion planning policy for zero-shot transfer across different robotic manipulators and environments. The policy is conditioned on semantically segmented 3D pointcloud representation of the workspace thus enabling implicit sim2real transfer. In the proposed approach, templates are formulated for manipulator kinematics and ground truth motion plans are collected for over 3 million procedurally sampled robots in randomized environments. The planning pipeline consists of a state validation model for differentiable collision detection and a sampling based planner for motion generation. AnyNMP has been validated on 5 different commercially available manipulators and showcases successful cross-embodiment planning, achieving an 80% average success rate on baseline benchmarks.
ContributorsRath, Prabin Kumar (Author) / Gopalan, Nakul (Thesis advisor) / Yu, Hongbin (Thesis advisor) / Yang, Yezhou (Committee member) / Arizona State University (Publisher)
Created2024
157215-Thumbnail Image.png
Description
Non-line-of-sight (NLOS) imaging of objects not visible to either the camera or illumina-

tion source is a challenging task with vital applications including surveillance and robotics.

Recent NLOS reconstruction advances have been achieved using time-resolved measure-

ments. Acquiring these time-resolved measurements requires expensive and specialized

detectors and laser sources. In work proposes a data-driven

Non-line-of-sight (NLOS) imaging of objects not visible to either the camera or illumina-

tion source is a challenging task with vital applications including surveillance and robotics.

Recent NLOS reconstruction advances have been achieved using time-resolved measure-

ments. Acquiring these time-resolved measurements requires expensive and specialized

detectors and laser sources. In work proposes a data-driven approach for NLOS 3D local-

ization requiring only a conventional camera and projector. The localisation is performed

using a voxelisation and a regression problem. Accuracy of greater than 90% is achieved

in localizing a NLOS object to a 5cm × 5cm × 5cm volume in real data. By adopting

the regression approach an object of width 10cm to localised to approximately 1.5cm. To

generalize to line-of-sight (LOS) scenes with non-planar surfaces, an adaptive lighting al-

gorithm is adopted. This algorithm, based on radiosity, identifies and illuminates scene

patches in the LOS which most contribute to the NLOS light paths, and can factor in sys-

tem power constraints. Improvements ranging from 6%-15% in accuracy with a non-planar

LOS wall using adaptive lighting is reported, demonstrating the advantage of combining

the physics of light transport with active illumination for data-driven NLOS imaging.
ContributorsChandran, Sreenithy (Author) / Jayasuriya, Suren (Thesis advisor) / Turaga, Pavan (Committee member) / Dasarathy, Gautam (Committee member) / Arizona State University (Publisher)
Created2019
155085-Thumbnail Image.png
Description
High-level inference tasks in video applications such as recognition, video retrieval, and zero-shot classification have become an active research area in recent years. One fundamental requirement for such applications is to extract high-quality features that maintain high-level information in the videos.

Many video feature extraction algorithms have been purposed, such

High-level inference tasks in video applications such as recognition, video retrieval, and zero-shot classification have become an active research area in recent years. One fundamental requirement for such applications is to extract high-quality features that maintain high-level information in the videos.

Many video feature extraction algorithms have been purposed, such as STIP, HOG3D, and Dense Trajectories. These algorithms are often referred to as “handcrafted” features as they were deliberately designed based on some reasonable considerations. However, these algorithms may fail when dealing with high-level tasks or complex scene videos. Due to the success of using deep convolution neural networks (CNNs) to extract global representations for static images, researchers have been using similar techniques to tackle video contents. Typical techniques first extract spatial features by processing raw images using deep convolution architectures designed for static image classifications. Then simple average, concatenation or classifier-based fusion/pooling methods are applied to the extracted features. I argue that features extracted in such ways do not acquire enough representative information since videos, unlike images, should be characterized as a temporal sequence of semantically coherent visual contents and thus need to be represented in a manner considering both semantic and spatio-temporal information.

In this thesis, I propose a novel architecture to learn semantic spatio-temporal embedding for videos to support high-level video analysis. The proposed method encodes video spatial and temporal information separately by employing a deep architecture consisting of two channels of convolutional neural networks (capturing appearance and local motion) followed by their corresponding Fully Connected Gated Recurrent Unit (FC-GRU) encoders for capturing longer-term temporal structure of the CNN features. The resultant spatio-temporal representation (a vector) is used to learn a mapping via a Fully Connected Multilayer Perceptron (FC-MLP) to the word2vec semantic embedding space, leading to a semantic interpretation of the video vector that supports high-level analysis. I evaluate the usefulness and effectiveness of this new video representation by conducting experiments on action recognition, zero-shot video classification, and semantic video retrieval (word-to-video) retrieval, using the UCF101 action recognition dataset.
ContributorsHu, Sheng-Hung (Author) / Li, Baoxin (Thesis advisor) / Turaga, Pavan (Committee member) / Liang, Jianming (Committee member) / Tong, Hanghang (Committee member) / Arizona State University (Publisher)
Created2016
155900-Thumbnail Image.png
Description
Compressive sensing theory allows to sense and reconstruct signals/images with lower sampling rate than Nyquist rate. Applications in resource constrained environment stand to benefit from this theory, opening up many possibilities for new applications at the same time. The traditional inference pipeline for computer vision sequence reconstructing the image from

Compressive sensing theory allows to sense and reconstruct signals/images with lower sampling rate than Nyquist rate. Applications in resource constrained environment stand to benefit from this theory, opening up many possibilities for new applications at the same time. The traditional inference pipeline for computer vision sequence reconstructing the image from compressive measurements. However,the reconstruction process is a computationally expensive step that also provides poor results at high compression rate. There have been several successful attempts to perform inference tasks directly on compressive measurements such as activity recognition. In this thesis, I am interested to tackle a more challenging vision problem - Visual question answering (VQA) without reconstructing the compressive images. I investigate the feasibility of this problem with a series of experiments, and I evaluate proposed methods on a VQA dataset and discuss promising results and direction for future work.
ContributorsHuang, Li-Chin (Author) / Turaga, Pavan (Thesis advisor) / Yang, Yezhou (Committee member) / Li, Baoxin (Committee member) / Arizona State University (Publisher)
Created2017
155809-Thumbnail Image.png
Description
Light field imaging is limited in its computational processing demands of high

sampling for both spatial and angular dimensions. Single-shot light field cameras

sacrifice spatial resolution to sample angular viewpoints, typically by multiplexing

incoming rays onto a 2D sensor array. While this resolution can be recovered using

compressive sensing, these iterative solutions are slow

Light field imaging is limited in its computational processing demands of high

sampling for both spatial and angular dimensions. Single-shot light field cameras

sacrifice spatial resolution to sample angular viewpoints, typically by multiplexing

incoming rays onto a 2D sensor array. While this resolution can be recovered using

compressive sensing, these iterative solutions are slow in processing a light field. We

present a deep learning approach using a new, two branch network architecture,

consisting jointly of an autoencoder and a 4D CNN, to recover a high resolution

4D light field from a single coded 2D image. This network decreases reconstruction

time significantly while achieving average PSNR values of 26-32 dB on a variety of

light fields. In particular, reconstruction time is decreased from 35 minutes to 6.7

minutes as compared to the dictionary method for equivalent visual quality. These

reconstructions are performed at small sampling/compression ratios as low as 8%,

allowing for cheaper coded light field cameras. We test our network reconstructions

on synthetic light fields, simulated coded measurements of real light fields captured

from a Lytro Illum camera, and real coded images from a custom CMOS diffractive

light field camera. The combination of compressive light field capture with deep

learning allows the potential for real-time light field video acquisition systems in the

future.
ContributorsGupta, Mayank (Author) / Turaga, Pavan (Thesis advisor) / Yang, Yezhou (Committee member) / Li, Baoxin (Committee member) / Arizona State University (Publisher)
Created2017
Description
The autonomous vehicle technology has come a long way, but currently, there are no companies that are able to offer fully autonomous ride in any conditions, on any road without any human supervision. These systems should be extensively trained and validated to guarantee safe human transportation. Any small errors in

The autonomous vehicle technology has come a long way, but currently, there are no companies that are able to offer fully autonomous ride in any conditions, on any road without any human supervision. These systems should be extensively trained and validated to guarantee safe human transportation. Any small errors in the system functionality may lead to fatal accidents and may endanger human lives. Deep learning methods are widely used for environment perception and prediction of hazardous situations. These techniques require huge amount of training data with both normal and abnormal samples to enable the vehicle to avoid a dangerous situation.



The goal of this thesis is to generate simulations from real-world tricky collision scenarios for training and testing autonomous vehicles. Dashcam crash videos from the internet can now be utilized to extract valuable collision data and recreate the crash scenarios in a simulator. The problem of extracting 3D vehicle trajectories from videos recorded by an unknown monocular camera source is solved using a modular approach. The framework is divided into two stages: (a) extracting meaningful adversarial trajectories from short crash videos, and (b) developing methods to automatically process and simulate the vehicle trajectories on a vehicle simulator.
ContributorsBashetty, Sai Krishna (Author) / Fainkeos, Georgios (Thesis advisor) / Amor, Heni Ben (Thesis advisor) / Turaga, Pavan (Committee member) / Arizona State University (Publisher)
Created2019
157673-Thumbnail Image.png
Description
In this thesis, I present two new datasets and a modification to the existing models in the form of a novel attention mechanism for Natural Language Inference (NLI). The new datasets have been carefully synthesized from various existing corpora released for different tasks.

The task of NLI is to determine the

In this thesis, I present two new datasets and a modification to the existing models in the form of a novel attention mechanism for Natural Language Inference (NLI). The new datasets have been carefully synthesized from various existing corpora released for different tasks.

The task of NLI is to determine the possibility of a sentence referred to as “Hypothesis” being true given that another sentence referred to as “Premise” is true. In other words, the task is to identify whether the “Premise” entails, contradicts or remains neutral with regards to the “Hypothesis”. NLI is a precursor to solving many Natural Language Processing (NLP) tasks such as Question Answering and Semantic Search. For example, in Question Answering systems, the question is paraphrased to form a declarative statement which is treated as the hypothesis. The options are treated as the premise. The option with the maximum entailment score is considered as the answer. Considering the applications of NLI, the importance of having a strong NLI system can't be stressed enough.

Many large-scale datasets and models have been released in order to advance the field of NLI. While all of these models do get good accuracy on the test sets of the datasets they were trained on, they fail to capture the basic understanding of “Entities” and “Roles”. They often make the mistake of inferring that “John went to the market.” from “Peter went to the market.” failing to capture the notion of “Entities”. In other cases, these models don't understand the difference in the “Roles” played by the same entities in “Premise” and “Hypothesis” sentences and end up wrongly inferring that “Peter drove John to the stadium.” from “John drove Peter to the stadium.”

The lack of understanding of “Roles” can be attributed to the lack of such examples in the various existing datasets. The reason for the existing model’s failure in capturing the notion of “Entities” is not just due to the lack of such examples in the existing NLI datasets. It can also be attributed to the strict use of vector similarity in the “word-to-word” attention mechanism being used in the existing architectures.

To overcome these issues, I present two new datasets to help make the NLI systems capture the notion of “Entities” and “Roles”. The “NER Changed” (NC) dataset and the “Role-Switched” (RS) dataset contains examples of Premise-Hypothesis pairs that require the understanding of “Entities” and “Roles” respectively in order to be able to make correct inferences. This work shows how the existing architectures perform poorly on the “NER Changed” (NC) dataset even after being trained on the new datasets. In order to help the existing architectures, understand the notion of “Entities”, this work proposes a modification to the “word-to-word” attention mechanism. Instead of relying on vector similarity alone, the modified architectures learn to incorporate the “Symbolic Similarity” as well by using the Named-Entity features of the Premise and Hypothesis sentences. The new modified architectures not only perform significantly better than the unmodified architectures on the “NER Changed” (NC) dataset but also performs as well on the existing datasets.
ContributorsShrivastava, Ishan (Author) / Baral, Chitta (Thesis advisor) / Anwar, Saadat (Committee member) / Yang, Yezhou (Committee member) / Arizona State University (Publisher)
Created2019
158597-Thumbnail Image.png
Description
Robot motion planning requires computing a sequence of waypoints from an initial configuration of the robot to the goal configuration. Solving a motion planning problem optimally is proven to be NP-Complete. Sampling-based motion planners efficiently compute an approximation of the optimal solution. They sample the configuration space uniformly and hence

Robot motion planning requires computing a sequence of waypoints from an initial configuration of the robot to the goal configuration. Solving a motion planning problem optimally is proven to be NP-Complete. Sampling-based motion planners efficiently compute an approximation of the optimal solution. They sample the configuration space uniformly and hence fail to sample regions of the environment that have narrow passages or pinch points. These critical regions are analogous to landmarks from planning literature as the robot is required to pass through them to reach the goal.

This work proposes a deep learning approach that identifies critical regions in the environment and learns a sampling distribution to effectively sample them in high dimensional configuration spaces.

A classification-based approach is used to learn the distributions. The robot degrees of freedom (DOF) limits are binned and a distribution is generated from sampling motion plan solutions. Conditional information like goal configuration and robot location encoded in the network inputs showcase the network learning to bias the identified critical regions towards the goal configuration. Empirical evaluations are performed against the state of the art sampling-based motion planners on a variety of tasks requiring the robot to pass through critical regions. An empirical analysis of robotic systems with three to eight degrees of freedom indicates that this approach effectively improves planning performance.
ContributorsSrinet, Abhyudaya (Author) / Srivastava, Siddharth (Thesis advisor) / Zhang, Yu (Committee member) / Yang, Yezhou (Committee member) / Arizona State University (Publisher)
Created2020
155378-Thumbnail Image.png
Description
To ensure system integrity, robots need to proactively avoid any unwanted physical perturbation that may cause damage to the underlying hardware. In this thesis work, we investigate a machine learning approach that allows robots to anticipate impending physical perturbations from perceptual cues. In contrast to other approaches that require knowledge

To ensure system integrity, robots need to proactively avoid any unwanted physical perturbation that may cause damage to the underlying hardware. In this thesis work, we investigate a machine learning approach that allows robots to anticipate impending physical perturbations from perceptual cues. In contrast to other approaches that require knowledge about sources of perturbation to be encoded before deployment, our method is based on experiential learning. Robots learn to associate visual cues with subsequent physical perturbations and contacts. In turn, these extracted visual cues are then used to predict potential future perturbations acting on the robot. To this end, we introduce a novel deep network architecture which combines multiple sub- networks for dealing with robot dynamics and perceptual input from the environment. We present a self-supervised approach for training the system that does not require any labeling of training data. Extensive experiments in a human-robot interaction task show that a robot can learn to predict physical contact by a human interaction partner without any prior information or labeling. Furthermore, the network is able to successfully predict physical contact from either depth stream input or traditional video input or using both modalities as input.
ContributorsSur, Indranil (Author) / Amor, Heni B (Thesis advisor) / Fainekos, Georgios (Committee member) / Yang, Yezhou (Committee member) / Arizona State University (Publisher)
Created2017