Matching Items (107)
To ensure system integrity, robots need to proactively avoid any unwanted physical perturbation that may cause damage to the underlying hardware. In this thesis work, we investigate a machine learning approach that allows robots to anticipate impending physical perturbations from perceptual cues. In contrast to other approaches that require knowledge about sources of perturbation to be encoded before deployment, our method is based on experiential learning. Robots learn to associate visual cues with subsequent physical perturbations and contacts. In turn, these extracted visual cues are then used to predict potential future perturbations acting on the robot. To this end, we introduce a novel deep network architecture which combines multiple sub- networks for dealing with robot dynamics and perceptual input from the environment. We present a self-supervised approach for training the system that does not require any labeling of training data. Extensive experiments in a human-robot interaction task show that a robot can learn to predict physical contact by a human interaction partner without any prior information or labeling. Furthermore, the network is able to successfully predict physical contact from either depth stream input or traditional video input or using both modalities as input.
In recent years, several methods have been proposed to encode sentences into fixed length continuous vectors called sentence representation or sentence embedding. With the recent advancements in various deep learning methods applied in Natural Language Processing (NLP), these representations play a crucial role in tasks such as named entity recognition, question answering and sentence classification.
Traditionally, sentence vector representations are learnt from its constituent word representations, also known as word embeddings. Various methods to learn the distributed representation (embedding) of words have been proposed using the notion of Distributional Semantics, i.e. “meaning of a word is characterized by the company it keeps”. However, principle of compositionality states that meaning of a sentence is a function of the meanings of words and also the way they are syntactically combined. In various recent methods for sentence representation, the syntactic information like dependency or relation between words have been largely ignored.
In this work, I have explored the effectiveness of sentence representations that are composed of the representation of both, its constituent words and the relations between the words in a sentence. The word and relation embeddings are learned based on their context. These general-purpose embeddings can also be used as off-the- shelf semantic and syntactic features for various NLP tasks. Similarity Evaluation tasks was performed on two datasets showing the usefulness of the learned word embeddings. Experiments were conducted on three different sentence classification tasks showing that our sentence representations outperform the original word-based sentence representations, when used with the state-of-the-art Neural Network architectures.
Compressive sensing theory allows to sense and reconstruct signals/images with lower sampling rate than Nyquist rate. Applications in resource constrained environment stand to benefit from this theory, opening up many possibilities for new applications at the same time. The traditional inference pipeline for computer vision sequence reconstructing the image from compressive measurements. However,the reconstruction process is a computationally expensive step that also provides poor results at high compression rate. There have been several successful attempts to perform inference tasks directly on compressive measurements such as activity recognition. In this thesis, I am interested to tackle a more challenging vision problem - Visual question answering (VQA) without reconstructing the compressive images. I investigate the feasibility of this problem with a series of experiments, and I evaluate proposed methods on a VQA dataset and discuss promising results and direction for future work.
Light field imaging is limited in its computational processing demands of high
sampling for both spatial and angular dimensions. Single-shot light field cameras
sacrifice spatial resolution to sample angular viewpoints, typically by multiplexing
incoming rays onto a 2D sensor array. While this resolution can be recovered using
compressive sensing, these iterative solutions are slow in processing a light field. We
present a deep learning approach using a new, two branch network architecture,
consisting jointly of an autoencoder and a 4D CNN, to recover a high resolution
4D light field from a single coded 2D image. This network decreases reconstruction
time significantly while achieving average PSNR values of 26-32 dB on a variety of
light fields. In particular, reconstruction time is decreased from 35 minutes to 6.7
minutes as compared to the dictionary method for equivalent visual quality. These
reconstructions are performed at small sampling/compression ratios as low as 8%,
allowing for cheaper coded light field cameras. We test our network reconstructions
on synthetic light fields, simulated coded measurements of real light fields captured
from a Lytro Illum camera, and real coded images from a custom CMOS diffractive
light field camera. The combination of compressive light field capture with deep
learning allows the potential for real-time light field video acquisition systems in the
LPMLN is a recent probabilistic logic programming language which combines both Answer Set Programming (ASP) and Markov Logic. It is a proper extension of Answer Set programs which allows for reasoning about uncertainty using weighted rules under the stable model semantics with a weight scheme that is adopted from Markov Logic. LPMLN has been shown to be related to several formalisms from the knowledge representation (KR) side such as ASP and P-Log, and the statistical relational learning (SRL) side such as Markov Logic Networks (MLN), Problog and Pearl’s causal models (PCM). Formalisms like ASP, P-Log, Problog, MLN, PCM have all been shown to embeddable in LPMLN which demonstrates the expressivity of the language. Interestingly, LPMLN has also been shown to reducible to ASP and MLN which is not only theoretically interesting, but also practically important from a computational point of view in that the reductions yield ways to compute LPMLN programs utilizing ASP and MLN solvers. Additionally, the reductions also allow the users to compute other formalisms which can be reduced to LPMLN.
This thesis realizes two implementations of LPMLN based on the reductions from LPMLN to ASP and LPMLN to MLN. This thesis first presents an implementation of LPMLN called LPMLN2ASP that uses standard ASP solvers for computing MAP inference using weak constraints, and marginal and conditional probabilities using stable models enumeration. Next, in this thesis, another implementation of LPMLN called LPMLN2MLN is presented that uses MLN solvers which apply completion to compute the tight fragment of LPMLN programs for MAP inference, marginal and conditional probabilities. The computation using ASP solvers yields exact inference as opposed to approximate inference using MLN solvers. Using these implementations, the usefulness of LPMLN for computing other formalisms is demonstrated by reducing them to LPMLN. The thesis also shows how the implementations are better than the native solvers of some of these formalisms on certain domains. The implementations make use of the current state of the art solving technologies in ASP and MLN, and therefore they benefit from any theoretical and practical advances in these technologies, thereby also benefiting the computation of other formalisms that can be reduced to LPMLN. Furthermore, the implementation also allows for certain SRL formalisms to be computed by ASP solvers, and certain KR formalisms to be computed by MLN solvers.
Computer Vision as a eld has gone through signicant changes in the last decade.
The eld has seen tremendous success in designing learning systems with hand-crafted
features and in using representation learning to extract better features. In this dissertation
some novel approaches to representation learning and task learning are studied.
Multiple-instance learning which is generalization of supervised learning, is one
example of task learning that is discussed. In particular, a novel non-parametric k-
NN-based multiple-instance learning is proposed, which is shown to outperform other
existing approaches. This solution is applied to a diabetic retinopathy pathology
detection problem eectively.
In cases of representation learning, generality of neural features are investigated
rst. This investigation leads to some critical understanding and results in feature
generality among datasets. The possibility of learning from a mentor network instead
of from labels is then investigated. Distillation of dark knowledge is used to eciently
mentor a small network from a pre-trained large mentor network. These studies help
in understanding representation learning with smaller and compressed networks.
The performance of most of the visual computing tasks depends on the quality of the features extracted from the raw data. Insightful feature representation increases the performance of many learning algorithms by exposing the underlying explanatory factors of the output for the unobserved input. A good representation should also handle anomalies in the data such as missing samples and noisy input caused by the undesired, external factors of variation. It should also reduce the data redundancy. Over the years, many feature extraction processes have been invented to produce good representations of raw images and videos.
The feature extraction processes can be categorized into three groups. The first group contains processes that are hand-crafted for a specific task. Hand-engineering features requires the knowledge of domain experts and manual labor. However, the feature extraction process is interpretable and explainable. Next group contains the latent-feature extraction processes. While the original feature lies in a high-dimensional space, the relevant factors for a task often lie on a lower dimensional manifold. The latent-feature extraction employs hidden variables to expose the underlying data properties that cannot be directly measured from the input. Latent features seek a specific structure such as sparsity or low-rank into the derived representation through sophisticated optimization techniques. The last category is that of deep features. These are obtained by passing raw input data with minimal pre-processing through a deep network. Its parameters are computed by iteratively minimizing a task-based loss.
In this dissertation, I present four pieces of work where I create and learn suitable data representations. The first task employs hand-crafted features to perform clinically-relevant retrieval of diabetic retinopathy images. The second task uses latent features to perform content-adaptive image enhancement. The third task ranks a pair of images based on their aestheticism. The goal of the last task is to capture localized image artifacts in small datasets with patch-level labels. For both these tasks, I propose novel deep architectures and show significant improvement over the previous state-of-art approaches. A suitable combination of feature representations augmented with an appropriate learning approach can increase performance for most visual computing tasks.
Reinforcement learning (RL) is a powerful methodology for teaching autonomous agents complex behaviors and skills. A critical component in most RL algorithms is the reward function -- a mathematical function that provides numerical estimates for desirable and undesirable states. Typically, the reward function must be hand-designed by a human expert and, as a result, the scope of a robot's autonomy and ability to safely explore and learn in new and unforeseen environments is constrained by the specifics of the designed reward function. In this thesis, I design and implement a stateful collision anticipation model with powerful predictive capability based upon my research of sequential data modeling and modern recurrent neural networks. I also develop deep reinforcement learning methods whose rewards are generated by self-supervised training and intrinsic signals. The main objective is to work towards the development of resilient robots that can learn to anticipate and avoid damaging interactions by combining visual and proprioceptive cues from internal sensors. The introduced solutions are inspired by pain pathways in humans and animals, because such pathways are known to guide decision-making processes and promote self-preservation. A new "robot dodge ball' benchmark is introduced in order to test the validity of the developed algorithms in dynamic environments.
In recent years, deep learning systems have outperformed traditional machine learning systems in most domains. There has been a lot of research recently in the field of hand gesture recognition using wearable sensors due to the numerous advantages these systems have over vision-based ones. However, due to the lack of extensive datasets and the nature of the Inertial Measurement Unit (IMU) data, there are difficulties in applying deep learning techniques to them. Although many machine learning models have good accuracy, most of them assume that training data is available for every user while other works that do not require user data have lower accuracies. MirrorGen is a technique which uses wearable sensor data and generates synthetic videos using hand movements and it mitigates the traditional challenges of vision based recognition such as occlusion, lighting restrictions, lack of viewpoint variations, and environmental noise. In addition, MirrorGen allows for user-independent recognition involving minimal human effort during data collection. It also helps leverage the advances in vision-based recognition by using various techniques like optical flow extraction, 3D convolution. Projecting the orientation (IMU) information to a video helps in gaining position information of the hands. To validate these claims, we perform entropy analysis on various configurations such as raw data, stick model, hand model and real video. Human hand model is found to have an optimal entropy that helps in achieving user independent recognition. It also serves as a pervasive option as opposed to a video-based recognition. The average user independent recognition accuracy of 99.03% was achieved for a sign language dataset with 59 different users, 20 different signs with 20 repetitions each for a total of 23k training instances. Moreover, synthetic videos can be used to augment real videos to improve recognition accuracy.
Rapid intraoperative diagnosis of brain tumors is of great importance for planning treatment and guiding the surgeon about the extent of resection. Currently, the standard for the preliminary intraoperative tissue analysis is frozen section biopsy that has major limitations such as tissue freezing and cutting artifacts, sampling errors, lack of immediate interaction between the pathologist and the surgeon, and time consuming.
Handheld, portable confocal laser endomicroscopy (CLE) is being explored in neurosurgery for its ability to image histopathological features of tissue at cellular resolution in real time during brain tumor surgery. Over the course of examination of the surgical tumor resection, hundreds to thousands of images may be collected. The high number of images requires significant time and storage load for subsequent reviewing, which motivated several research groups to employ deep convolutional neural networks (DCNNs) to improve its utility during surgery. DCNNs have proven to be useful in natural and medical image analysis tasks such as classification, object detection, and image segmentation.
This thesis proposes using DCNNs for analyzing CLE images of brain tumors. Particularly, it explores the practicality of DCNNs in three main tasks. First, off-the shelf DCNNs were used to classify images into diagnostic and non-diagnostic. Further experiments showed that both ensemble modeling and transfer learning improved the classifier’s accuracy in evaluating the diagnostic quality of new images at test stage. Second, a weakly-supervised learning pipeline was developed for localizing key features of diagnostic CLE images from gliomas. Third, image style transfer was used to improve the diagnostic quality of CLE images from glioma tumors by transforming the histology patterns in CLE images of fluorescein sodium-stained tissue into the ones in conventional hematoxylin and eosin-stained tissue slides.
These studies suggest that DCNNs are opted for analysis of CLE images. They may assist surgeons in sorting out the non-diagnostic images, highlighting the key regions and enhancing their appearance through pattern transformation in real time. With recent advances in deep learning such as generative adversarial networks and semi-supervised learning, new research directions need to be followed to discover more promises of DCNNs in CLE image analysis.