Matching Items (3)
158824-Thumbnail Image.png
Description
Video analysis and understanding have obtained more and more attention in recent years. The research community also has devoted considerable effort and made progress in many related visual tasks, like video action/event recognition, thumbnail frame or video index retrieval, and zero-shot learning. The way to find good representative features of

Video analysis and understanding have obtained more and more attention in recent years. The research community also has devoted considerable effort and made progress in many related visual tasks, like video action/event recognition, thumbnail frame or video index retrieval, and zero-shot learning. The way to find good representative features of videos is an important objective for these visual tasks.

Thanks to the success of deep neural networks in recent vision tasks, it is natural to take the deep learning methods into consideration for better extraction of a global representation of the images and videos. In general, Convolutional Neural Network (CNN) is utilized for obtaining the spatial information, and Recurrent Neural Network (RNN) is leveraged for capturing the temporal information.

This dissertation provides a perspective of the challenging problems in different kinds of videos which may require different solutions. Therefore, several novel deep learning-based approaches of obtaining representative features are outlined for different visual tasks like zero-shot learning, video retrieval, and video event recognition in this dissertation. To better understand and obtained the video spatial and temporal information, Convolutional Neural Network and Recurrent Neural Network are jointly utilized in most approaches. And different experiments are conducted to present the importance and effectiveness of good representative features for obtaining a better knowledge of video clips in the computer vision field. This dissertation also concludes a discussion with possible future works of obtaining better representative features of more challenging video clips.
ContributorsLi, Yikang (Author) / Li, Baoxin BL (Thesis advisor) / Karam, Lina LK (Committee member) / LiKamWa, Robert RL (Committee member) / Yang, Yezhou YY (Committee member) / Arizona State University (Publisher)
Created2020
161863-Thumbnail Image.png
Description
The field of computer vision has achieved tremendous progress over recent years with innovations in deep learning and neural networks. The advances have unprecedentedly enabled an intelligent agent to understand the world from its visual observations, such as recognizing an object, detecting the object's position, and estimating the distance to

The field of computer vision has achieved tremendous progress over recent years with innovations in deep learning and neural networks. The advances have unprecedentedly enabled an intelligent agent to understand the world from its visual observations, such as recognizing an object, detecting the object's position, and estimating the distance to the object. It then comes to a question of how such visual understanding can be used to support the agent's decisions over its actions to perform a task. This dissertation aims to study this question in which several methods are presented to address the challenges in learning a desirable action policy from the agent's visual inputs for the agent to perform a task well. Specifically, this dissertation starts with learning an action policy from high dimensional visual observations by improving the sample efficiency. The improved sample efficiency is achieved through a denser reward function defined upon the visual understanding of the task, and an efficient exploration strategy equipped with a hierarchical policy. It further studies the generalizable action policy learning problem. The generalizability is achieved for both a fully observable task with local environment dynamic captured by visual representations, and a partially observable task with global environment dynamic captured by a novel graph representation. Finally, this dissertation explores learning from human-provided priors, such as natural language instructions and demonstration videos for better generalization ability.
ContributorsYe, Xin (Author) / Yang, Yezhou YY (Thesis advisor) / Ren, Yi YR (Committee member) / Pavlic, Theodore TP (Committee member) / Fan, Deliang DF (Committee member) / Srivastava, Siddharth SS (Committee member) / Arizona State University (Publisher)
Created2021
190885-Thumbnail Image.png
Description
In image classification tasks, images are often corrupted by spatial transformationslike translations and rotations. In this work, I utilize an existing method that uses the Fourier series expansion to generate a rotation and translation invariant representation of closed contours found in sketches, aiming to attenuate the effects of distribution shift caused

In image classification tasks, images are often corrupted by spatial transformationslike translations and rotations. In this work, I utilize an existing method that uses the Fourier series expansion to generate a rotation and translation invariant representation of closed contours found in sketches, aiming to attenuate the effects of distribution shift caused by the aforementioned transformations. I use this technique to transform input images into one of two different invariant representations, a Fourier series representation and a corrected raster image representation, prior to passing them to a neural network for classification. The architectures used include convolutional neutral networks (CNNs), multi-layer perceptrons (MLPs), and graph neural networks (GNNs). I compare the performance of this method to using data augmentation during training, the standard approach for addressing distribution shift, to see which strategy yields the best performance when evaluated against a test set with rotations and translations applied. I include experiments where the augmentations applied during training both do and do not accurately reflect the transformations encountered at test time. Additionally, I investigate the robustness of both approaches to high-frequency noise. In each experiment, I also compare training efficiency across models. I conduct experiments on three data sets, the MNIST handwritten digit dataset, a custom dataset (QD-3) consisting of three classes of geometric figures from the Quick, Draw! hand-drawn sketch dataset, and another custom dataset (QD-345) featuring sketches from all 345 classes found in Quick, Draw!. On the smaller problem space of MNIST and QD-3, the networks utilizing the Fourier-based technique to attenuate distribution shift perform competitively with the standard data augmentation strategy. On the more complex problem space of QD-345, the networks using the Fourier technique do not achieve the same test performance as correctly-applied data augmentation. However, they still outperform instances where train-time augmentations mis-predict test-time transformations, and outperform a naive baseline model where no strategy is used to attenuate distribution shift. Overall, this work provides evidence that strategies which attempt to directly mitigate distribution shift, rather than simply increasing the diversity of the training data, can be successful when certain conditions hold.
ContributorsWatson, Matthew (Author) / Yang, Yezhou YY (Thesis advisor) / Kerner, Hannah HK (Committee member) / Yang, Yingzhen YY (Committee member) / Arizona State University (Publisher)
Created2023