This collection includes both ASU Theses and Dissertations, submitted by graduate students, and the Barrett, Honors College theses submitted by undergraduate students. 

Displaying 1 - 2 of 2
Filtering by

Clear all filters

155085-Thumbnail Image.png
Description
High-level inference tasks in video applications such as recognition, video retrieval, and zero-shot classification have become an active research area in recent years. One fundamental requirement for such applications is to extract high-quality features that maintain high-level information in the videos.

Many video feature extraction algorithms have been purposed, such

High-level inference tasks in video applications such as recognition, video retrieval, and zero-shot classification have become an active research area in recent years. One fundamental requirement for such applications is to extract high-quality features that maintain high-level information in the videos.

Many video feature extraction algorithms have been purposed, such as STIP, HOG3D, and Dense Trajectories. These algorithms are often referred to as “handcrafted” features as they were deliberately designed based on some reasonable considerations. However, these algorithms may fail when dealing with high-level tasks or complex scene videos. Due to the success of using deep convolution neural networks (CNNs) to extract global representations for static images, researchers have been using similar techniques to tackle video contents. Typical techniques first extract spatial features by processing raw images using deep convolution architectures designed for static image classifications. Then simple average, concatenation or classifier-based fusion/pooling methods are applied to the extracted features. I argue that features extracted in such ways do not acquire enough representative information since videos, unlike images, should be characterized as a temporal sequence of semantically coherent visual contents and thus need to be represented in a manner considering both semantic and spatio-temporal information.

In this thesis, I propose a novel architecture to learn semantic spatio-temporal embedding for videos to support high-level video analysis. The proposed method encodes video spatial and temporal information separately by employing a deep architecture consisting of two channels of convolutional neural networks (capturing appearance and local motion) followed by their corresponding Fully Connected Gated Recurrent Unit (FC-GRU) encoders for capturing longer-term temporal structure of the CNN features. The resultant spatio-temporal representation (a vector) is used to learn a mapping via a Fully Connected Multilayer Perceptron (FC-MLP) to the word2vec semantic embedding space, leading to a semantic interpretation of the video vector that supports high-level analysis. I evaluate the usefulness and effectiveness of this new video representation by conducting experiments on action recognition, zero-shot video classification, and semantic video retrieval (word-to-video) retrieval, using the UCF101 action recognition dataset.
ContributorsHu, Sheng-Hung (Author) / Li, Baoxin (Thesis advisor) / Turaga, Pavan (Committee member) / Liang, Jianming (Committee member) / Tong, Hanghang (Committee member) / Arizona State University (Publisher)
Created2016
158717-Thumbnail Image.png
Description
Semantic image segmentation has been a key topic in applications involving image processing and computer vision. Owing to the success and continuous research in the field of deep learning, there have been plenty of deep learning-based segmentation architectures that have been designed for various tasks. In this thesis, deep-learning architectures

Semantic image segmentation has been a key topic in applications involving image processing and computer vision. Owing to the success and continuous research in the field of deep learning, there have been plenty of deep learning-based segmentation architectures that have been designed for various tasks. In this thesis, deep-learning architectures for a specific application in material science; namely the segmentation process for the non-destructive study of the microstructure of Aluminum Alloy AA 7075 have been developed. This process requires the use of various imaging tools and methodologies to obtain the ground-truth information. The image dataset obtained using Transmission X-ray microscopy (TXM) consists of raw 2D image specimens captured from the projections at every beam scan. The segmented 2D ground-truth images are obtained by applying reconstruction and filtering algorithms before using a scientific visualization tool for segmentation. These images represent the corrosive behavior caused by the precipitates and inclusions particles on the Aluminum AA 7075 alloy. The study of the tools that work best for X-ray microscopy-based imaging is still in its early stages.

In this thesis, the underlying concepts behind Convolutional Neural Networks (CNNs) and state-of-the-art Semantic Segmentation architectures have been discussed in detail. The data generation and pre-processing process applied to the AA 7075 Data have also been described, along with the experimentation methodologies performed on the baseline and four other state-of-the-art Segmentation architectures that predict the segmented boundaries from the raw 2D images. A performance analysis based on various factors to decide the best techniques and tools to apply Semantic image segmentation for X-ray microscopy-based imaging was also conducted.
ContributorsBarboza, Daniel (Author) / Turaga, Pavan (Thesis advisor) / Chawla, Nikhilesh (Committee member) / Jayasuriya, Suren (Committee member) / Arizona State University (Publisher)
Created2020