Matching Items (71)
Filtering by

Clear all filters

155963-Thumbnail Image.png
Description
Computer Vision as a eld has gone through signicant changes in the last decade.

The eld has seen tremendous success in designing learning systems with hand-crafted

features and in using representation learning to extract better features. In this dissertation

some novel approaches to representation learning and task learning are studied.

Multiple-instance learning which is

Computer Vision as a eld has gone through signicant changes in the last decade.

The eld has seen tremendous success in designing learning systems with hand-crafted

features and in using representation learning to extract better features. In this dissertation

some novel approaches to representation learning and task learning are studied.

Multiple-instance learning which is generalization of supervised learning, is one

example of task learning that is discussed. In particular, a novel non-parametric k-

NN-based multiple-instance learning is proposed, which is shown to outperform other

existing approaches. This solution is applied to a diabetic retinopathy pathology

detection problem eectively.

In cases of representation learning, generality of neural features are investigated

rst. This investigation leads to some critical understanding and results in feature

generality among datasets. The possibility of learning from a mentor network instead

of from labels is then investigated. Distillation of dark knowledge is used to eciently

mentor a small network from a pre-trained large mentor network. These studies help

in understanding representation learning with smaller and compressed networks.
ContributorsVenkatesan, Ragav (Author) / Li, Baoxin (Thesis advisor) / Turaga, Pavan (Committee member) / Yang, Yezhou (Committee member) / Davulcu, Hasan (Committee member) / Arizona State University (Publisher)
Created2017
156193-Thumbnail Image.png
Description
With the rise of the Big Data Era, an exponential amount of network data is being generated at an unprecedented rate across a wide-range of high impact micro and macro areas of research---from protein interaction to social networks. The critical challenge is translating this large scale network data into actionable

With the rise of the Big Data Era, an exponential amount of network data is being generated at an unprecedented rate across a wide-range of high impact micro and macro areas of research---from protein interaction to social networks. The critical challenge is translating this large scale network data into actionable information.

A key task in the data translation is the analysis of network connectivity via marked nodes---the primary focus of our research. We have developed a framework for analyzing network connectivity via marked nodes in large scale graphs, utilizing novel algorithms in three interrelated areas: (1) analysis of a single seed node via it’s ego-centric network (AttriPart algorithm); (2) pathway identification between two seed nodes (K-Simple Shortest Paths Multithreaded and Search Reduced (KSSPR) algorithm); and (3) tree detection, defining the interaction between three or more seed nodes (Shortest Path MST algorithm).

In an effort to address both fundamental and applied research issues, we have developed the LocalForcasting algorithm to explore how network connectivity analysis can be applied to local community evolution and recommender systems. The goal is to apply the LocalForecasting algorithm to various domains---e.g., friend suggestions in social networks or future collaboration in co-authorship networks. This algorithm utilizes link prediction in combination with the AttriPart algorithm to predict future connections in local graph partitions.

Results show that our proposed AttriPart algorithm finds up to 1.6x denser local partitions, while running approximately 43x faster than traditional local partitioning techniques (PageRank-Nibble). In addition, our LocalForecasting algorithm demonstrates a significant improvement in the number of nodes and edges correctly predicted over baseline methods. Furthermore, results for the KSSPR algorithm demonstrate a speed-up of up to 2.5x the standard k-simple shortest paths algorithm.
ContributorsFreitas, Scott (Author) / Tong, Hanghang (Thesis advisor) / Maciejewski, Ross (Committee member) / Yang, Yezhou (Committee member) / Arizona State University (Publisher)
Created2018
156084-Thumbnail Image.png
Description
The performance of most of the visual computing tasks depends on the quality of the features extracted from the raw data. Insightful feature representation increases the performance of many learning algorithms by exposing the underlying explanatory factors of the output for the unobserved input. A good representation should also handle

The performance of most of the visual computing tasks depends on the quality of the features extracted from the raw data. Insightful feature representation increases the performance of many learning algorithms by exposing the underlying explanatory factors of the output for the unobserved input. A good representation should also handle anomalies in the data such as missing samples and noisy input caused by the undesired, external factors of variation. It should also reduce the data redundancy. Over the years, many feature extraction processes have been invented to produce good representations of raw images and videos.

The feature extraction processes can be categorized into three groups. The first group contains processes that are hand-crafted for a specific task. Hand-engineering features requires the knowledge of domain experts and manual labor. However, the feature extraction process is interpretable and explainable. Next group contains the latent-feature extraction processes. While the original feature lies in a high-dimensional space, the relevant factors for a task often lie on a lower dimensional manifold. The latent-feature extraction employs hidden variables to expose the underlying data properties that cannot be directly measured from the input. Latent features seek a specific structure such as sparsity or low-rank into the derived representation through sophisticated optimization techniques. The last category is that of deep features. These are obtained by passing raw input data with minimal pre-processing through a deep network. Its parameters are computed by iteratively minimizing a task-based loss.

In this dissertation, I present four pieces of work where I create and learn suitable data representations. The first task employs hand-crafted features to perform clinically-relevant retrieval of diabetic retinopathy images. The second task uses latent features to perform content-adaptive image enhancement. The third task ranks a pair of images based on their aestheticism. The goal of the last task is to capture localized image artifacts in small datasets with patch-level labels. For both these tasks, I propose novel deep architectures and show significant improvement over the previous state-of-art approaches. A suitable combination of feature representations augmented with an appropriate learning approach can increase performance for most visual computing tasks.
ContributorsChandakkar, Parag Shridhar (Author) / Li, Baoxin (Thesis advisor) / Yang, Yezhou (Committee member) / Turaga, Pavan (Committee member) / Davulcu, Hasan (Committee member) / Arizona State University (Publisher)
Created2017
156036-Thumbnail Image.png
Description
Topological methods for data analysis present opportunities for enforcing certain invariances of broad interest in computer vision: including view-point in activity analysis, articulation in shape analysis, and measurement invariance in non-linear dynamical modeling. The increasing success of these methods is attributed to the complementary information that topology provides, as well

Topological methods for data analysis present opportunities for enforcing certain invariances of broad interest in computer vision: including view-point in activity analysis, articulation in shape analysis, and measurement invariance in non-linear dynamical modeling. The increasing success of these methods is attributed to the complementary information that topology provides, as well as availability of tools for computing topological summaries such as persistence diagrams. However, persistence diagrams are multi-sets of points and hence it is not straightforward to fuse them with features used for contemporary machine learning tools like deep-nets. In this paper theoretically well-grounded approaches to develop novel perturbation robust topological representations are presented, with the long-term view of making them amenable to fusion with contemporary learning architectures. The proposed representation lives on a Grassmann manifold and hence can be efficiently used in machine learning pipelines.

The proposed representation.The efficacy of the proposed descriptor was explored on three applications: view-invariant activity analysis, 3D shape analysis, and non-linear dynamical modeling. Favorable results in both high-level recognition performance and improved performance in reduction of time-complexity when compared to other baseline methods are obtained.
ContributorsThopalli, Kowshik (Author) / Turaga, Pavan Kumar (Thesis advisor) / Papandreou-Suppappola, Antonia (Committee member) / Yang, Yezhou (Committee member) / Arizona State University (Publisher)
Created2017
156586-Thumbnail Image.png
Description
Image Understanding is a long-established discipline in computer vision, which encompasses a body of advanced image processing techniques, that are used to locate (“where”), characterize and recognize (“what”) objects, regions, and their attributes in the image. However, the notion of “understanding” (and the goal of artificial intelligent machines) goes beyond

Image Understanding is a long-established discipline in computer vision, which encompasses a body of advanced image processing techniques, that are used to locate (“where”), characterize and recognize (“what”) objects, regions, and their attributes in the image. However, the notion of “understanding” (and the goal of artificial intelligent machines) goes beyond factual recall of the recognized components and includes reasoning and thinking beyond what can be seen (or perceived). Understanding is often evaluated by asking questions of increasing difficulty. Thus, the expected functionalities of an intelligent Image Understanding system can be expressed in terms of the functionalities that are required to answer questions about an image. Answering questions about images require primarily three components: Image Understanding, question (natural language) understanding, and reasoning based on knowledge. Any question, asking beyond what can be directly seen, requires modeling of commonsense (or background/ontological/factual) knowledge and reasoning.

Knowledge and reasoning have seen scarce use in image understanding applications. In this thesis, we demonstrate the utilities of incorporating background knowledge and using explicit reasoning in image understanding applications. We first present a comprehensive survey of the previous work that utilized background knowledge and reasoning in understanding images. This survey outlines the limited use of commonsense knowledge in high-level applications. We then present a set of vision and reasoning-based methods to solve several applications and show that these approaches benefit in terms of accuracy and interpretability from the explicit use of knowledge and reasoning. We propose novel knowledge representations of image, knowledge acquisition methods, and a new implementation of an efficient probabilistic logical reasoning engine that can utilize publicly available commonsense knowledge to solve applications such as visual question answering, image puzzles. Additionally, we identify the need for new datasets that explicitly require external commonsense knowledge to solve. We propose the new task of Image Riddles, which requires a combination of vision, and reasoning based on ontological knowledge; and we collect a sufficiently large dataset to serve as an ideal testbed for vision and reasoning research. Lastly, we propose end-to-end deep architectures that can combine vision, knowledge and reasoning modules together and achieve large performance boosts over state-of-the-art methods.
ContributorsAditya, Somak (Author) / Baral, Chitta (Thesis advisor) / Yang, Yezhou (Thesis advisor) / Aloimonos, Yiannis (Committee member) / Lee, Joohyung (Committee member) / Li, Baoxin (Committee member) / Arizona State University (Publisher)
Created2018
156577-Thumbnail Image.png
Description
Network mining has been attracting a lot of research attention because of the prevalence of networks. As the world is becoming increasingly connected and correlated, networks arising from inter-dependent application domains are often collected from different sources, forming the so-called multi-sourced networks. Examples of such multi-sourced networks include critical infrastructure

Network mining has been attracting a lot of research attention because of the prevalence of networks. As the world is becoming increasingly connected and correlated, networks arising from inter-dependent application domains are often collected from different sources, forming the so-called multi-sourced networks. Examples of such multi-sourced networks include critical infrastructure networks, multi-platform social networks, cross-domain collaboration networks, and many more. Compared with single-sourced network, multi-sourced networks bear more complex structures and therefore could potentially contain more valuable information.

This thesis proposes a multi-layered HITS (Hyperlink-Induced Topic Search) algorithm to perform the ranking task on multi-sourced networks. Specifically, each node in the network receives an authority score and a hub score for evaluating the value of the node itself and the value of its outgoing links respectively. Based on a recent multi-layered network model, which allows more flexible dependency structure across different sources (i.e., layers), the proposed algorithm leverages both within-layer smoothness and cross-layer consistency. This essentially allows nodes from different layers to be ranked accordingly. The multi-layered HITS is formulated as a regularized optimization problem with non-negative constraint and solved by an iterative update process. Extensive experimental evaluations demonstrate the effectiveness and explainability of the proposed algorithm.
ContributorsYu, Haichao (Author) / Tong, Hanghang (Thesis advisor) / He, Jingrui (Committee member) / Yang, Yezhou (Committee member) / Arizona State University (Publisher)
Created2018
156611-Thumbnail Image.png
Description
Handwritten documents have gained popularity in various domains including education and business. A key task in analyzing a complex document is to distinguish between various content types such as text, math, graphics, tables and so on. For example, one such aspect could be a region on the document with a

Handwritten documents have gained popularity in various domains including education and business. A key task in analyzing a complex document is to distinguish between various content types such as text, math, graphics, tables and so on. For example, one such aspect could be a region on the document with a mathematical expression; in this case, the label would be math. This differentiation facilitates the performance of specific recognition tasks depending on the content type. We hypothesize that the recognition accuracy of the subsequent tasks such as textual, math, and shape recognition will increase, further leading to a better analysis of the document.

Content detection on handwritten documents assigns a particular class to a homogeneous portion of the document. To complete this task, a set of handwritten solutions was digitally collected from middle school students located in two different geographical regions in 2017 and 2018. This research discusses the methods to collect, pre-process and detect content type in the collected handwritten documents. A total of 4049 documents were extracted in the form of image, and json format; and were labelled using an object labelling software with tags being text, math, diagram, cross out, table, graph, tick mark, arrow, and doodle. The labelled images were fed to the Tensorflow’s object detection API to learn a neural network model. We show our results from two neural networks models, Faster Region-based Convolutional Neural Network (Faster R-CNN) and Single Shot detection model (SSD).
ContributorsFaizaan, Shaik Mohammed (Author) / VanLehn, Kurt (Thesis advisor) / Cheema, Salman Shaukat (Thesis advisor) / Yang, Yezhou (Committee member) / Arizona State University (Publisher)
Created2018
156783-Thumbnail Image.png
Description
In recent years, deep learning systems have outperformed traditional machine learning systems in most domains. There has been a lot of research recently in the field of hand gesture recognition using wearable sensors due to the numerous advantages these systems have over vision-based ones. However, due to the lack of

In recent years, deep learning systems have outperformed traditional machine learning systems in most domains. There has been a lot of research recently in the field of hand gesture recognition using wearable sensors due to the numerous advantages these systems have over vision-based ones. However, due to the lack of extensive datasets and the nature of the Inertial Measurement Unit (IMU) data, there are difficulties in applying deep learning techniques to them. Although many machine learning models have good accuracy, most of them assume that training data is available for every user while other works that do not require user data have lower accuracies. MirrorGen is a technique which uses wearable sensor data and generates synthetic videos using hand movements and it mitigates the traditional challenges of vision based recognition such as occlusion, lighting restrictions, lack of viewpoint variations, and environmental noise. In addition, MirrorGen allows for user-independent recognition involving minimal human effort during data collection. It also helps leverage the advances in vision-based recognition by using various techniques like optical flow extraction, 3D convolution. Projecting the orientation (IMU) information to a video helps in gaining position information of the hands. To validate these claims, we perform entropy analysis on various configurations such as raw data, stick model, hand model and real video. Human hand model is found to have an optimal entropy that helps in achieving user independent recognition. It also serves as a pervasive option as opposed to a video-based recognition. The average user independent recognition accuracy of 99.03% was achieved for a sign language dataset with 59 different users, 20 different signs with 20 repetitions each for a total of 23k training instances. Moreover, synthetic videos can be used to augment real videos to improve recognition accuracy.
ContributorsRamesh, Arun Srivatsa (Author) / Gupta, Sandeep K S (Thesis advisor) / Banerjee, Ayan (Committee member) / Yang, Yezhou (Committee member) / Arizona State University (Publisher)
Created2018
156833-Thumbnail Image.png
Description
Mixed reality mobile platforms co-locate virtual objects with physical spaces, creating immersive user experiences. To create visual harmony between virtual and physical spaces, the virtual scene must be accurately illuminated with realistic physical lighting. To this end, a system was designed that Generates Light Estimation Across Mixed-reality (GLEAM) devices to

Mixed reality mobile platforms co-locate virtual objects with physical spaces, creating immersive user experiences. To create visual harmony between virtual and physical spaces, the virtual scene must be accurately illuminated with realistic physical lighting. To this end, a system was designed that Generates Light Estimation Across Mixed-reality (GLEAM) devices to continually sense realistic lighting of a physical scene in all directions. GLEAM optionally operate across multiple mobile mixed-reality devices to leverage collaborative multi-viewpoint sensing for improved estimation. The system implements policies that prioritize resolution, coverage, or update interval of the illumination estimation depending on the situational needs of the virtual scene and physical environment.

To evaluate the runtime performance and perceptual efficacy of the system, GLEAM was implemented on the Unity 3D Game Engine. The implementation was deployed on Android and iOS devices. On these implementations, GLEAM can prioritize dynamic estimation with update intervals as low as 15 ms or prioritize high spatial quality with update intervals of 200 ms. User studies across 99 participants and 26 scene comparisons reported a preference towards GLEAM over other lighting techniques in 66.67% of the presented augmented scenes and indifference in 12.57% of the scenes. A controlled lighting user study on 18 participants revealed a general preference for policies that strike a balance between resolution and update rate.
ContributorsPrakash, Siddhant (Author) / LiKamWa, Robert (Thesis advisor) / Yang, Yezhou (Thesis advisor) / Hansford, Dianne (Committee member) / Arizona State University (Publisher)
Created2018
156898-Thumbnail Image.png
Description
Virtual digital assistants are automated software systems which assist humans by understanding natural languages such as English, either in voice or textual form. In recent times, a lot of digital applications have shifted towards providing a user experience using natural language interface. The change is brought up by the degree

Virtual digital assistants are automated software systems which assist humans by understanding natural languages such as English, either in voice or textual form. In recent times, a lot of digital applications have shifted towards providing a user experience using natural language interface. The change is brought up by the degree of ease with which the virtual digital assistants such as Google Assistant and Amazon Alexa can be integrated into your application. These assistants make use of a Natural Language Understanding (NLU) system which acts as an interface to translate unstructured natural language data into a structured form. Such an NLU system uses an intent finding algorithm which gives a high-level idea or meaning of a user query, termed as intent classification. The intent classification step identifies the action(s) that a user wants the assistant to perform. The intent classification step is followed by an entity recognition step in which the entities in the utterance are identified on which the intended action is performed. This step can be viewed as a sequence labeling task which maps an input word sequence into a corresponding sequence of slot labels. This step is also termed as slot filling.

In this thesis, we improve the intent classification and slot filling in the virtual voice agents by automatic data augmentation. Spoken Language Understanding systems face the issue of data sparsity. The reason behind this is that it is hard for a human-created training sample to represent all the patterns in the language. Due to the lack of relevant data, deep learning methods are unable to generalize the Spoken Language Understanding model. This thesis expounds a way to overcome the issue of data sparsity in deep learning approaches on Spoken Language Understanding tasks. Here we have described the limitations in the current intent classifiers and how the proposed algorithm uses existing knowledge bases to overcome those limitations. The method helps in creating a more robust intent classifier and slot filling system.
ContributorsGarg, Prashant (Author) / Baral, Chitta (Thesis advisor) / Kumar, Hemanth (Committee member) / Yang, Yezhou (Committee member) / Arizona State University (Publisher)
Created2018