Theses and Dissertations
Filtering by
- All Subjects: deep learning
- Creators: Turaga, Pavan
- Creators: Chakrabarti, Chaitali CC
The eld has seen tremendous success in designing learning systems with hand-crafted
features and in using representation learning to extract better features. In this dissertation
some novel approaches to representation learning and task learning are studied.
Multiple-instance learning which is generalization of supervised learning, is one
example of task learning that is discussed. In particular, a novel non-parametric k-
NN-based multiple-instance learning is proposed, which is shown to outperform other
existing approaches. This solution is applied to a diabetic retinopathy pathology
detection problem eectively.
In cases of representation learning, generality of neural features are investigated
rst. This investigation leads to some critical understanding and results in feature
generality among datasets. The possibility of learning from a mentor network instead
of from labels is then investigated. Distillation of dark knowledge is used to eciently
mentor a small network from a pre-trained large mentor network. These studies help
in understanding representation learning with smaller and compressed networks.
The feature extraction processes can be categorized into three groups. The first group contains processes that are hand-crafted for a specific task. Hand-engineering features requires the knowledge of domain experts and manual labor. However, the feature extraction process is interpretable and explainable. Next group contains the latent-feature extraction processes. While the original feature lies in a high-dimensional space, the relevant factors for a task often lie on a lower dimensional manifold. The latent-feature extraction employs hidden variables to expose the underlying data properties that cannot be directly measured from the input. Latent features seek a specific structure such as sparsity or low-rank into the derived representation through sophisticated optimization techniques. The last category is that of deep features. These are obtained by passing raw input data with minimal pre-processing through a deep network. Its parameters are computed by iteratively minimizing a task-based loss.
In this dissertation, I present four pieces of work where I create and learn suitable data representations. The first task employs hand-crafted features to perform clinically-relevant retrieval of diabetic retinopathy images. The second task uses latent features to perform content-adaptive image enhancement. The third task ranks a pair of images based on their aestheticism. The goal of the last task is to capture localized image artifacts in small datasets with patch-level labels. For both these tasks, I propose novel deep architectures and show significant improvement over the previous state-of-art approaches. A suitable combination of feature representations augmented with an appropriate learning approach can increase performance for most visual computing tasks.
First, this work presents an application of mixture of experts models for quality robust visual recognition. First it is shown that human subjects outperform deep neural networks on classification of distorted images, and then propose a model, MixQualNet, that is more robust to distortions. The proposed model consists of ``experts'' that are trained on a particular type of image distortion. The final output of the model is a weighted sum of the expert models, where the weights are determined by a separate gating network. The proposed model also incorporates weight sharing to reduce the number of parameters, as well as increase performance.
Second, an application of mixture of experts to predict visual saliency is presented. A computational saliency model attempts to predict where humans will look in an image. In the proposed model, each expert network is trained to predict saliency for a set of closely related images. The final saliency map is computed as a weighted mixture of the expert networks' outputs, with weights determined by a separate gating network. The proposed model achieves better performance than several other visual saliency models and a baseline non-mixture model.
Finally, this work introduces a saliency model that is a weighted mixture of models trained for different levels of saliency. Levels of saliency include high saliency, which corresponds to regions where almost all subjects look, and low saliency, which corresponds to regions where some, but not all subjects look. The weighted mixture shows improved performance compared with baseline models because of the diversity of the individual model predictions.
tion source is a challenging task with vital applications including surveillance and robotics.
Recent NLOS reconstruction advances have been achieved using time-resolved measure-
ments. Acquiring these time-resolved measurements requires expensive and specialized
detectors and laser sources. In work proposes a data-driven approach for NLOS 3D local-
ization requiring only a conventional camera and projector. The localisation is performed
using a voxelisation and a regression problem. Accuracy of greater than 90% is achieved
in localizing a NLOS object to a 5cm × 5cm × 5cm volume in real data. By adopting
the regression approach an object of width 10cm to localised to approximately 1.5cm. To
generalize to line-of-sight (LOS) scenes with non-planar surfaces, an adaptive lighting al-
gorithm is adopted. This algorithm, based on radiosity, identifies and illuminates scene
patches in the LOS which most contribute to the NLOS light paths, and can factor in sys-
tem power constraints. Improvements ranging from 6%-15% in accuracy with a non-planar
LOS wall using adaptive lighting is reported, demonstrating the advantage of combining
the physics of light transport with active illumination for data-driven NLOS imaging.
Many video feature extraction algorithms have been purposed, such as STIP, HOG3D, and Dense Trajectories. These algorithms are often referred to as “handcrafted” features as they were deliberately designed based on some reasonable considerations. However, these algorithms may fail when dealing with high-level tasks or complex scene videos. Due to the success of using deep convolution neural networks (CNNs) to extract global representations for static images, researchers have been using similar techniques to tackle video contents. Typical techniques first extract spatial features by processing raw images using deep convolution architectures designed for static image classifications. Then simple average, concatenation or classifier-based fusion/pooling methods are applied to the extracted features. I argue that features extracted in such ways do not acquire enough representative information since videos, unlike images, should be characterized as a temporal sequence of semantically coherent visual contents and thus need to be represented in a manner considering both semantic and spatio-temporal information.
In this thesis, I propose a novel architecture to learn semantic spatio-temporal embedding for videos to support high-level video analysis. The proposed method encodes video spatial and temporal information separately by employing a deep architecture consisting of two channels of convolutional neural networks (capturing appearance and local motion) followed by their corresponding Fully Connected Gated Recurrent Unit (FC-GRU) encoders for capturing longer-term temporal structure of the CNN features. The resultant spatio-temporal representation (a vector) is used to learn a mapping via a Fully Connected Multilayer Perceptron (FC-MLP) to the word2vec semantic embedding space, leading to a semantic interpretation of the video vector that supports high-level analysis. I evaluate the usefulness and effectiveness of this new video representation by conducting experiments on action recognition, zero-shot video classification, and semantic video retrieval (word-to-video) retrieval, using the UCF101 action recognition dataset.
sampling for both spatial and angular dimensions. Single-shot light field cameras
sacrifice spatial resolution to sample angular viewpoints, typically by multiplexing
incoming rays onto a 2D sensor array. While this resolution can be recovered using
compressive sensing, these iterative solutions are slow in processing a light field. We
present a deep learning approach using a new, two branch network architecture,
consisting jointly of an autoencoder and a 4D CNN, to recover a high resolution
4D light field from a single coded 2D image. This network decreases reconstruction
time significantly while achieving average PSNR values of 26-32 dB on a variety of
light fields. In particular, reconstruction time is decreased from 35 minutes to 6.7
minutes as compared to the dictionary method for equivalent visual quality. These
reconstructions are performed at small sampling/compression ratios as low as 8%,
allowing for cheaper coded light field cameras. We test our network reconstructions
on synthetic light fields, simulated coded measurements of real light fields captured
from a Lytro Illum camera, and real coded images from a custom CMOS diffractive
light field camera. The combination of compressive light field capture with deep
learning allows the potential for real-time light field video acquisition systems in the
future.
This work addresses the following four problems: (i) Will a blockage occur in the near future? (ii) When will this blockage occur? (iii) What is the type of the blockage? And (iv) what is the direction of the moving blockage? The proposed solution utilizes deep neural networks (DNN) as well as non-machine learning (ML) algorithms. At the heart of the proposed method is identification of special patterns of received signal and sensory data before the blockage occurs (\textit{pre-blockage signatures}) and to infer future blockages utilizing these signatures. To evaluate the proposed approach, first real-world datasets are built for both in-band mmWave system and LiDAR-aided in mmWave systems based on the DeepSense 6G structure. In particular, for in-band mmWave system, two real-world datasets are constructed -- one for indoor scenario and the other for outdoor scenario. Then DNN models are developed to proactively predict the incoming blockages for both scenarios. For LiDAR-aided blockage prediction, a large-scale real-world dataset that includes co-existing LiDAR and mmWave communication measurements is constructed for outdoor scenarios. Then, an efficient LiDAR data denoising (static cluster removal) algorithm is designed to clear the dataset noise. Finally, a non-ML method and a DNN model that proactively predict dynamic link blockages are developed. Experiments using in-band mmWave datasets show that, the proposed approach can successfully predict the occurrence of future dynamic blockages (up to 5 s) with more than 80% accuracy (indoor scenario). Further, for the outdoor scenario with highly-mobile vehicular blockages, the proposed model can predict the exact time of the future blockage with less than 100 ms error for blockages happening within the future 600 ms. Further, our proposed method can predict the size and moving direction of the blockages. For the co-existing LiDAR and mmWave real-world dataset, our LiDAR-aided approach is shown to achieve above 95% accuracy in predicting blockages occurring within 100 ms and more than 80% prediction accuracy for blockages occurring within one second. Further, for the outdoor scenario with highly-mobile vehicular blockages, the proposed model can predict the exact time of the future blockage with less than 150 ms error for blockages happening within one second. In addition, our method achieves above 92% accuracy to classify the type of blockages and above 90% accuracy predicting the blockage moving direction. The proposed solutions can potentially provide an order of magnitude saving in the network latency, thereby highlighting a promising approach for addressing the blockage challenges in mmWave/sub-THz networks.