Matching Items (16)
Filtering by

Clear all filters

155963-Thumbnail Image.png
Description
Computer Vision as a eld has gone through signicant changes in the last decade.

The eld has seen tremendous success in designing learning systems with hand-crafted

features and in using representation learning to extract better features. In this dissertation

some novel approaches to representation learning and task learning are studied.

Multiple-instance learning which is

Computer Vision as a eld has gone through signicant changes in the last decade.

The eld has seen tremendous success in designing learning systems with hand-crafted

features and in using representation learning to extract better features. In this dissertation

some novel approaches to representation learning and task learning are studied.

Multiple-instance learning which is generalization of supervised learning, is one

example of task learning that is discussed. In particular, a novel non-parametric k-

NN-based multiple-instance learning is proposed, which is shown to outperform other

existing approaches. This solution is applied to a diabetic retinopathy pathology

detection problem eectively.

In cases of representation learning, generality of neural features are investigated

rst. This investigation leads to some critical understanding and results in feature

generality among datasets. The possibility of learning from a mentor network instead

of from labels is then investigated. Distillation of dark knowledge is used to eciently

mentor a small network from a pre-trained large mentor network. These studies help

in understanding representation learning with smaller and compressed networks.
ContributorsVenkatesan, Ragav (Author) / Li, Baoxin (Thesis advisor) / Turaga, Pavan (Committee member) / Yang, Yezhou (Committee member) / Davulcu, Hasan (Committee member) / Arizona State University (Publisher)
Created2017
156084-Thumbnail Image.png
Description
The performance of most of the visual computing tasks depends on the quality of the features extracted from the raw data. Insightful feature representation increases the performance of many learning algorithms by exposing the underlying explanatory factors of the output for the unobserved input. A good representation should also handle

The performance of most of the visual computing tasks depends on the quality of the features extracted from the raw data. Insightful feature representation increases the performance of many learning algorithms by exposing the underlying explanatory factors of the output for the unobserved input. A good representation should also handle anomalies in the data such as missing samples and noisy input caused by the undesired, external factors of variation. It should also reduce the data redundancy. Over the years, many feature extraction processes have been invented to produce good representations of raw images and videos.

The feature extraction processes can be categorized into three groups. The first group contains processes that are hand-crafted for a specific task. Hand-engineering features requires the knowledge of domain experts and manual labor. However, the feature extraction process is interpretable and explainable. Next group contains the latent-feature extraction processes. While the original feature lies in a high-dimensional space, the relevant factors for a task often lie on a lower dimensional manifold. The latent-feature extraction employs hidden variables to expose the underlying data properties that cannot be directly measured from the input. Latent features seek a specific structure such as sparsity or low-rank into the derived representation through sophisticated optimization techniques. The last category is that of deep features. These are obtained by passing raw input data with minimal pre-processing through a deep network. Its parameters are computed by iteratively minimizing a task-based loss.

In this dissertation, I present four pieces of work where I create and learn suitable data representations. The first task employs hand-crafted features to perform clinically-relevant retrieval of diabetic retinopathy images. The second task uses latent features to perform content-adaptive image enhancement. The third task ranks a pair of images based on their aestheticism. The goal of the last task is to capture localized image artifacts in small datasets with patch-level labels. For both these tasks, I propose novel deep architectures and show significant improvement over the previous state-of-art approaches. A suitable combination of feature representations augmented with an appropriate learning approach can increase performance for most visual computing tasks.
ContributorsChandakkar, Parag Shridhar (Author) / Li, Baoxin (Thesis advisor) / Yang, Yezhou (Committee member) / Turaga, Pavan (Committee member) / Davulcu, Hasan (Committee member) / Arizona State University (Publisher)
Created2017
156430-Thumbnail Image.png
Description
Machine learning models convert raw data in the form of video, images, audio,

text, etc. into feature representations that are convenient for computational process-

ing. Deep neural networks have proven to be very efficient feature extractors for a

variety of machine learning tasks. Generative models based on deep neural networks

introduce constraints on the

Machine learning models convert raw data in the form of video, images, audio,

text, etc. into feature representations that are convenient for computational process-

ing. Deep neural networks have proven to be very efficient feature extractors for a

variety of machine learning tasks. Generative models based on deep neural networks

introduce constraints on the feature space to learn transferable and disentangled rep-

resentations. Transferable feature representations help in training machine learning

models that are robust across different distributions of data. For example, with the

application of transferable features in domain adaptation, models trained on a source

distribution can be applied to a data from a target distribution even though the dis-

tributions may be different. In style transfer and image-to-image translation, disen-

tangled representations allow for the separation of style and content when translating

images.

This thesis examines learning transferable data representations in novel deep gen-

erative models. The Semi-Supervised Adversarial Translator (SAT) utilizes adversar-

ial methods and cross-domain weight sharing in a neural network to extract trans-

ferable representations. These transferable interpretations can then be decoded into

the original image or a similar image in another domain. The Explicit Disentangling

Network (EDN) utilizes generative methods to disentangle images into their core at-

tributes and then segments sets of related attributes. The EDN can separate these

attributes by controlling the ow of information using a novel combination of losses

and network architecture. This separation of attributes allows precise modi_cations

to speci_c components of the data representation, boosting the performance of ma-

chine learning tasks. The effectiveness of these models is evaluated across domain

adaptation, style transfer, and image-to-image translation tasks.
ContributorsEusebio, Jose Miguel Ang (Author) / Panchanathan, Sethuraman (Thesis advisor) / Davulcu, Hasan (Committee member) / Venkateswara, Hemanth (Committee member) / Arizona State University (Publisher)
Created2018
156747-Thumbnail Image.png
Description
Mixture of experts is a machine learning ensemble approach that consists of individual models that are trained to be ``experts'' on subsets of the data, and a gating network that provides weights to output a combination of the expert predictions. Mixture of experts models do not currently see wide use

Mixture of experts is a machine learning ensemble approach that consists of individual models that are trained to be ``experts'' on subsets of the data, and a gating network that provides weights to output a combination of the expert predictions. Mixture of experts models do not currently see wide use due to difficulty in training diverse experts and high computational requirements. This work presents modifications of the mixture of experts formulation that use domain knowledge to improve training, and incorporate parameter sharing among experts to reduce computational requirements.

First, this work presents an application of mixture of experts models for quality robust visual recognition. First it is shown that human subjects outperform deep neural networks on classification of distorted images, and then propose a model, MixQualNet, that is more robust to distortions. The proposed model consists of ``experts'' that are trained on a particular type of image distortion. The final output of the model is a weighted sum of the expert models, where the weights are determined by a separate gating network. The proposed model also incorporates weight sharing to reduce the number of parameters, as well as increase performance.



Second, an application of mixture of experts to predict visual saliency is presented. A computational saliency model attempts to predict where humans will look in an image. In the proposed model, each expert network is trained to predict saliency for a set of closely related images. The final saliency map is computed as a weighted mixture of the expert networks' outputs, with weights determined by a separate gating network. The proposed model achieves better performance than several other visual saliency models and a baseline non-mixture model.

Finally, this work introduces a saliency model that is a weighted mixture of models trained for different levels of saliency. Levels of saliency include high saliency, which corresponds to regions where almost all subjects look, and low saliency, which corresponds to regions where some, but not all subjects look. The weighted mixture shows improved performance compared with baseline models because of the diversity of the individual model predictions.
ContributorsDodge, Samuel Fuller (Author) / Karam, Lina (Thesis advisor) / Jayasuriya, Suren (Committee member) / Li, Baoxin (Committee member) / Turaga, Pavan (Committee member) / Arizona State University (Publisher)
Created2018
157174-Thumbnail Image.png
Description
Fraud is defined as the utilization of deception for illegal gain by hiding the true nature of the activity. While organizations lose around $3.7 trillion in revenue due to financial crimes and fraud worldwide, they can affect all levels of society significantly. In this dissertation, I focus on credit card

Fraud is defined as the utilization of deception for illegal gain by hiding the true nature of the activity. While organizations lose around $3.7 trillion in revenue due to financial crimes and fraud worldwide, they can affect all levels of society significantly. In this dissertation, I focus on credit card fraud in online transactions. Every online transaction comes with a fraud risk and it is the merchant's liability to detect and stop fraudulent transactions. Merchants utilize various mechanisms to prevent and manage fraud such as automated fraud detection systems and manual transaction reviews by expert fraud analysts. Many proposed solutions mostly focus on fraud detection accuracy and ignore financial considerations. Also, the highly effective manual review process is overlooked. First, I propose Profit Optimizing Neural Risk Manager (PONRM), a selective classifier that (a) constitutes optimal collaboration between machine learning models and human expertise under industrial constraints, (b) is cost and profit sensitive. I suggest directions on how to characterize fraudulent behavior and assess the risk of a transaction. I show that my framework outperforms cost-sensitive and cost-insensitive baselines on three real-world merchant datasets. While PONRM is able to work with many supervised learners and obtain convincing results, utilizing probability outputs directly from the trained model itself can pose problems, especially in deep learning as softmax output is not a true uncertainty measure. This phenomenon, and the wide and rapid adoption of deep learning by practitioners brought unintended consequences in many situations such as in the infamous case of Google Photos' racist image recognition algorithm; thus, necessitated the utilization of the quantified uncertainty for each prediction. There have been recent efforts towards quantifying uncertainty in conventional deep learning methods (e.g., dropout as Bayesian approximation); however, their optimal use in decision making is often overlooked and understudied. Thus, I present a mixed-integer programming framework for selective classification called MIPSC, that investigates and combines model uncertainty and predictive mean to identify optimal classification and rejection regions. I also extend this framework to cost-sensitive settings (MIPCSC) and focus on the critical real-world problem, online fraud management and show that my approach outperforms industry standard methods significantly for online fraud management in real-world settings.
ContributorsYildirim, Mehmet Yigit (Author) / Davulcu, Hasan (Thesis advisor) / Bakkaloglu, Bertan (Committee member) / Huang, Dijiang (Committee member) / Hsiao, Ihan (Committee member) / Arizona State University (Publisher)
Created2019
157215-Thumbnail Image.png
Description
Non-line-of-sight (NLOS) imaging of objects not visible to either the camera or illumina-

tion source is a challenging task with vital applications including surveillance and robotics.

Recent NLOS reconstruction advances have been achieved using time-resolved measure-

ments. Acquiring these time-resolved measurements requires expensive and specialized

detectors and laser sources. In work proposes a data-driven

Non-line-of-sight (NLOS) imaging of objects not visible to either the camera or illumina-

tion source is a challenging task with vital applications including surveillance and robotics.

Recent NLOS reconstruction advances have been achieved using time-resolved measure-

ments. Acquiring these time-resolved measurements requires expensive and specialized

detectors and laser sources. In work proposes a data-driven approach for NLOS 3D local-

ization requiring only a conventional camera and projector. The localisation is performed

using a voxelisation and a regression problem. Accuracy of greater than 90% is achieved

in localizing a NLOS object to a 5cm × 5cm × 5cm volume in real data. By adopting

the regression approach an object of width 10cm to localised to approximately 1.5cm. To

generalize to line-of-sight (LOS) scenes with non-planar surfaces, an adaptive lighting al-

gorithm is adopted. This algorithm, based on radiosity, identifies and illuminates scene

patches in the LOS which most contribute to the NLOS light paths, and can factor in sys-

tem power constraints. Improvements ranging from 6%-15% in accuracy with a non-planar

LOS wall using adaptive lighting is reported, demonstrating the advantage of combining

the physics of light transport with active illumination for data-driven NLOS imaging.
ContributorsChandran, Sreenithy (Author) / Jayasuriya, Suren (Thesis advisor) / Turaga, Pavan (Committee member) / Dasarathy, Gautam (Committee member) / Arizona State University (Publisher)
Created2019
154885-Thumbnail Image.png
Description
Computational visual aesthetics has recently become an active research area. Existing state-of-art methods formulate this as a binary classification task where a given image is predicted to be beautiful or not. In many applications such as image retrieval and enhancement, it is more important to rank images based on their

Computational visual aesthetics has recently become an active research area. Existing state-of-art methods formulate this as a binary classification task where a given image is predicted to be beautiful or not. In many applications such as image retrieval and enhancement, it is more important to rank images based on their aesthetic quality instead of binary-categorizing them. Furthermore, in such applications, it may be possible that all images belong to the same category. Hence determining the aesthetic ranking of the images is more appropriate. To this end, a novel problem of ranking images with respect to their aesthetic quality is formulated in this work. A new data-set of image pairs with relative labels is constructed by carefully selecting images from the popular AVA data-set. Unlike in aesthetics classification, there is no single threshold which would determine the ranking order of the images across the entire data-set.

This problem is attempted using a deep neural network based approach that is trained on image pairs by incorporating principles from relative learning. Results show that such relative training procedure allows the network to rank the images with a higher accuracy than a state-of-art network trained on the same set of images using binary labels. Further analyzing the results show that training a model using the image pairs learnt better aesthetic features than training on same number of individual binary labelled images.

Additionally, an attempt is made at enhancing the performance of the system by incorporating saliency related information. Given an image, humans might fixate their vision on particular parts of the image, which they might be subconsciously intrigued to. I therefore tried to utilize the saliency information both stand-alone as well as in combination with the global and local aesthetic features by performing two separate sets of experiments. In both the cases, a standard saliency model is chosen and the generated saliency maps are convoluted with the images prior to passing them to the network, thus giving higher importance to the salient regions as compared to the remaining. Thus generated saliency-images are either used independently or along with the global and the local features to train the network. Empirical results show that the saliency related aesthetic features might already be learnt by the network as a sub-set of the global features from automatic feature extraction, thus proving the redundancy of the additional saliency module.
ContributorsGattupalli, Jaya Vijetha (Author) / Li, Baoxin (Thesis advisor) / Davulcu, Hasan (Committee member) / Liang, Jianming (Committee member) / Arizona State University (Publisher)
Created2016
155085-Thumbnail Image.png
Description
High-level inference tasks in video applications such as recognition, video retrieval, and zero-shot classification have become an active research area in recent years. One fundamental requirement for such applications is to extract high-quality features that maintain high-level information in the videos.

Many video feature extraction algorithms have been purposed, such

High-level inference tasks in video applications such as recognition, video retrieval, and zero-shot classification have become an active research area in recent years. One fundamental requirement for such applications is to extract high-quality features that maintain high-level information in the videos.

Many video feature extraction algorithms have been purposed, such as STIP, HOG3D, and Dense Trajectories. These algorithms are often referred to as “handcrafted” features as they were deliberately designed based on some reasonable considerations. However, these algorithms may fail when dealing with high-level tasks or complex scene videos. Due to the success of using deep convolution neural networks (CNNs) to extract global representations for static images, researchers have been using similar techniques to tackle video contents. Typical techniques first extract spatial features by processing raw images using deep convolution architectures designed for static image classifications. Then simple average, concatenation or classifier-based fusion/pooling methods are applied to the extracted features. I argue that features extracted in such ways do not acquire enough representative information since videos, unlike images, should be characterized as a temporal sequence of semantically coherent visual contents and thus need to be represented in a manner considering both semantic and spatio-temporal information.

In this thesis, I propose a novel architecture to learn semantic spatio-temporal embedding for videos to support high-level video analysis. The proposed method encodes video spatial and temporal information separately by employing a deep architecture consisting of two channels of convolutional neural networks (capturing appearance and local motion) followed by their corresponding Fully Connected Gated Recurrent Unit (FC-GRU) encoders for capturing longer-term temporal structure of the CNN features. The resultant spatio-temporal representation (a vector) is used to learn a mapping via a Fully Connected Multilayer Perceptron (FC-MLP) to the word2vec semantic embedding space, leading to a semantic interpretation of the video vector that supports high-level analysis. I evaluate the usefulness and effectiveness of this new video representation by conducting experiments on action recognition, zero-shot video classification, and semantic video retrieval (word-to-video) retrieval, using the UCF101 action recognition dataset.
ContributorsHu, Sheng-Hung (Author) / Li, Baoxin (Thesis advisor) / Turaga, Pavan (Committee member) / Liang, Jianming (Committee member) / Tong, Hanghang (Committee member) / Arizona State University (Publisher)
Created2016
155900-Thumbnail Image.png
Description
Compressive sensing theory allows to sense and reconstruct signals/images with lower sampling rate than Nyquist rate. Applications in resource constrained environment stand to benefit from this theory, opening up many possibilities for new applications at the same time. The traditional inference pipeline for computer vision sequence reconstructing the image from

Compressive sensing theory allows to sense and reconstruct signals/images with lower sampling rate than Nyquist rate. Applications in resource constrained environment stand to benefit from this theory, opening up many possibilities for new applications at the same time. The traditional inference pipeline for computer vision sequence reconstructing the image from compressive measurements. However,the reconstruction process is a computationally expensive step that also provides poor results at high compression rate. There have been several successful attempts to perform inference tasks directly on compressive measurements such as activity recognition. In this thesis, I am interested to tackle a more challenging vision problem - Visual question answering (VQA) without reconstructing the compressive images. I investigate the feasibility of this problem with a series of experiments, and I evaluate proposed methods on a VQA dataset and discuss promising results and direction for future work.
ContributorsHuang, Li-Chin (Author) / Turaga, Pavan (Thesis advisor) / Yang, Yezhou (Committee member) / Li, Baoxin (Committee member) / Arizona State University (Publisher)
Created2017
155809-Thumbnail Image.png
Description
Light field imaging is limited in its computational processing demands of high

sampling for both spatial and angular dimensions. Single-shot light field cameras

sacrifice spatial resolution to sample angular viewpoints, typically by multiplexing

incoming rays onto a 2D sensor array. While this resolution can be recovered using

compressive sensing, these iterative solutions are slow

Light field imaging is limited in its computational processing demands of high

sampling for both spatial and angular dimensions. Single-shot light field cameras

sacrifice spatial resolution to sample angular viewpoints, typically by multiplexing

incoming rays onto a 2D sensor array. While this resolution can be recovered using

compressive sensing, these iterative solutions are slow in processing a light field. We

present a deep learning approach using a new, two branch network architecture,

consisting jointly of an autoencoder and a 4D CNN, to recover a high resolution

4D light field from a single coded 2D image. This network decreases reconstruction

time significantly while achieving average PSNR values of 26-32 dB on a variety of

light fields. In particular, reconstruction time is decreased from 35 minutes to 6.7

minutes as compared to the dictionary method for equivalent visual quality. These

reconstructions are performed at small sampling/compression ratios as low as 8%,

allowing for cheaper coded light field cameras. We test our network reconstructions

on synthetic light fields, simulated coded measurements of real light fields captured

from a Lytro Illum camera, and real coded images from a custom CMOS diffractive

light field camera. The combination of compressive light field capture with deep

learning allows the potential for real-time light field video acquisition systems in the

future.
ContributorsGupta, Mayank (Author) / Turaga, Pavan (Thesis advisor) / Yang, Yezhou (Committee member) / Li, Baoxin (Committee member) / Arizona State University (Publisher)
Created2017