Search Content

Multimodal Representation Learning for Visual Reasoning and Text-to-Image Translation

Description

Multimodal Representation Learning is a multi-disciplinary research field which aims to integrate information from multiple communicative modalities in a meaningful manner to help solve some downstream task. These modalities can be visual, acoustic, linguistic, haptic etc. The interpretation of ’meaningful integration of information from different modalities’ remains modality and task…

Multimodal Representation Learning is a multi-disciplinary research field which aims to integrate information from multiple communicative modalities in a meaningful manner to help solve some downstream task. These modalities can be visual, acoustic, linguistic, haptic etc. The interpretation of ’meaningful integration of information from different modalities’ remains modality and task dependent. The downstream task can range from understanding one modality in the presence of information from other modalities, to that of translating input from one modality to another. In this thesis the utility of multimodal representation learning for understanding one modality vis-à-vis Image Understanding for Visual Reasoning given corresponding information in other modalities, as well as translating from one modality to the other, specifically, Text to Image Translation was investigated.

Visual Reasoning has been an active area of research in computer vision. It encompasses advanced image processing and artificial intelligence techniques to locate, characterize and recognize objects, regions and their attributes in the image in order to comprehend the image itself. One way of building a visual reasoning system is to ask the system to answer questions about the image that requires attribute identification, counting, comparison, multi-step attention, and reasoning. An intelligent system is thought to have a proper grasp of the image if it can answer said questions correctly and provide a valid reasoning for the given answers. In this work how a system can be built by learning a multimodal representation between the stated image and the questions was investigated. Also, how background knowledge, specifically scene-graph information, if available, can be incorporated into existing image understanding models was demonstrated.

Multimodal learning provides an intuitive way of learning a joint representation between different modalities. Such a joint representation can be used to translate from one modality to the other. It also gives way to learning a shared representation between these varied modalities and allows to provide meaning to what this shared representation should capture. In this work, using the surrogate task of text to image translation, neural network based architectures to learn a shared representation between these two modalities was investigated. Also, the ability that such a shared representation is capable of capturing parts of different modalities that are equivalent in some sense is proposed. Specifically, given an image and a semantic description of certain objects present in the image, a shared representation between the text and the image modality capable of capturing parts of the image being mentioned in the text was demonstrated. Such a capability was showcased on a publicly available dataset.

ContributorsSaha, Rudra (Author) / Yang, Yezhou (Thesis advisor) / Singh, Maneesh Kumar (Committee member) / Baral, Chitta (Committee member) / Arizona State University (Publisher)

Created2018

Identifying critical regions for robot planning using convolutional neural networks

Description

In this thesis, a new approach to learning-based planning is presented where critical regions of an environment with low probability measure are learned from a given set of motion plans. Critical regions are learned using convolutional neural networks (CNN) to improve sampling processes for motion planning (MP).

In addition to an…

In this thesis, a new approach to learning-based planning is presented where critical regions of an environment with low probability measure are learned from a given set of motion plans. Critical regions are learned using convolutional neural networks (CNN) to improve sampling processes for motion planning (MP).

In addition to an identification network, a new sampling-based motion planner, Learn and Link, is introduced. This planner leverages critical regions to overcome the limitations of uniform sampling while still maintaining guarantees of correctness inherent to sampling-based algorithms. Learn and Link is evaluated against planners from the Open Motion Planning Library (OMPL) on an extensive suite of challenging navigation planning problems. This work shows that critical areas of an environment are learnable, and can be used by Learn and Link to solve MP problems with far less planning time than existing sampling-based planners.

ContributorsMolina, Daniel, M.S (Author) / Srivastava, Siddharth (Thesis advisor) / Li, Baoxin (Committee member) / Zhang, Yu (Committee member) / Arizona State University (Publisher)

Created2019

Decision Support for Crew Scheduling using Automated Planning

Description

Allocating tasks for a day's or week's schedule is known to be a challenging and difficult problem. The problem intensifies by many folds in multi-agent settings. A planner or group of planners who decide such kind of task association schedule must have a comprehensive perspective on (1) the entire array…

Allocating tasks for a day's or week's schedule is known to be a challenging and difficult problem. The problem intensifies by many folds in multi-agent settings. A planner or group of planners who decide such kind of task association schedule must have a comprehensive perspective on (1) the entire array of tasks to be scheduled (2) idea on constraints like importance cum order of tasks and (3) the individual abilities of the operators. One example of such kind of scheduling is the crew scheduling done for astronauts who will spend time at International Space Station (ISS). The schedule for the crew of ISS is decided before the mission starts. Human planners take part in the decision-making process to determine the timing of activities for multiple days for multiple crew members at ISS. Given the unpredictability of individual assignments and limitations identified with the various operators, deciding upon a satisfactory timetable is a challenging task. The objective of the current work is to develop an automated decision assistant that would assist human planners in coming up with an acceptable task schedule for the crew. At the same time, the decision assistant will also ensure that human planners are always in the driver's seat throughout this process of decision-making.

The decision assistant will make use of automated planning technology to assist human planners. The guidelines of Naturalistic Decision Making (NDM) and the Human-In-The -Loop decision making were followed to make sure that the human is always in the driver's seat. The use cases considered are standard situations which come up during decision-making in crew-scheduling. The effectiveness of automated decision assistance was evaluated by setting it up for domain experts on a comparable domain of scheduling courses for master students. The results of the user study evaluating the effectiveness of automated decision support were subsequently published.

ContributorsMIshra, Aditya Prasad (Author) / Kambhampati, Subbarao (Thesis advisor) / Chiou, Erin (Committee member) / Demakethepalli Venkateswara, Hemanth Kumar (Committee member) / Arizona State University (Publisher)

Created2019

Analyzing the effect of an "open learner model" represented through a feedback system in a teachable agent system

Description

For this master's thesis, an open learner model is integrated with Quinn, a teachable robotic agent developed at Arizona State University. This system is represented as a feedback system, which aims to improve a student’s understanding of a subject. It also helps to understand the effect of the learner model…

For this master's thesis, an open learner model is integrated with Quinn, a teachable robotic agent developed at Arizona State University. This system is represented as a feedback system, which aims to improve a student’s understanding of a subject. It also helps to understand the effect of the learner model when it is represented by performance of the teachable agent. The feedback system represents performance of the teachable agent, and not of a student. Data in the feedback system is thus updated according to a student's understanding of the subject. This provides students an opportunity to enhance their understanding of a subject by analyzing their performance. To test the effectiveness of the feedback system, student understanding in two different conditions is analyzed. In the first condition a feedback report is not provided to the students, while in the second condition the feedback report is provided in the form of the agent’s performance.

ContributorsUpadhyay, Abha (Author) / Walker, Erin (Thesis advisor) / Nelson, Brian (Committee member) / Amresh, Ashish (Committee member) / Arizona State University (Publisher)

Created2016

Approximate neural networks for speech applications in resource-constrained environments

Description

Speech recognition and keyword detection are becoming increasingly popular applications for mobile systems. While deep neural network (DNN) implementation of these systems have very good performance,

they have large memory and compute resource requirements, making their implementation on a mobile device quite challenging. In this thesis, techniques to reduce the…

Speech recognition and keyword detection are becoming increasingly popular applications for mobile systems. While deep neural network (DNN) implementation of these systems have very good performance,

they have large memory and compute resource requirements, making their implementation on a mobile device quite challenging. In this thesis, techniques to reduce the memory and computation cost

of keyword detection and speech recognition networks (or DNNs) are presented.

The first technique is based on representing all weights and biases by a small number of bits and mapping all nodal computations into fixed-point ones with minimal degradation in the

accuracy. Experiments conducted on the Resource Management (RM) database show that for the keyword detection neural network, representing the weights by 5 bits results in a 6 fold reduction in memory compared to a floating point implementation with very little loss in performance. Similarly, for the speech recognition neural network, representing the weights by 6 bits results in a 5 fold reduction in memory while maintaining an error rate similar to a floating point implementation. Additional reduction in memory is achieved by a technique called weight pruning,

where the weights are classified as sensitive and insensitive and the sensitive weights are represented with higher precision. A combination of these two techniques helps reduce the memory

footprint by 81 - 84% for speech recognition and keyword detection networks respectively.

Further reduction in memory size is achieved by judiciously dropping connections for large blocks of weights. The corresponding technique, termed coarse-grain sparsification, introduces

hardware-aware sparsity during DNN training, which leads to efficient weight memory compression and significant reduction in the number of computations during classification without

loss of accuracy. Keyword detection and speech recognition DNNs trained with 75% of the weights dropped and classified with 5-6 bit weight precision effectively reduced the weight memory

requirement by ~95% compared to a fully-connected network with double precision, while showing similar performance in keyword detection accuracy and word error rate.

ContributorsArunachalam, Sairam (Author) / Chakrabarti, Chaitali (Thesis advisor) / Seo, Jae-Sun (Thesis advisor) / Cao, Yu (Committee member) / Arizona State University (Publisher)

Created2016

Fixed verse generation using neural word embeddings

Description

For the past three decades, the design of an effective strategy for generating poetry that matches that of a human’s creative capabilities and complexities has been an elusive goal in artificial intelligence (AI) and natural language generation (NLG) research, and among linguistic creativity researchers in particular. This thesis presents a…

For the past three decades, the design of an effective strategy for generating poetry that matches that of a human’s creative capabilities and complexities has been an elusive goal in artificial intelligence (AI) and natural language generation (NLG) research, and among linguistic creativity researchers in particular. This thesis presents a novel approach to fixed verse poetry generation using neural word embeddings. During the course of generation, a two layered poetry classifier is developed. The first layer uses a lexicon based method to classify poems into types based on form and structure, and the second layer uses a supervised classification method to classify poems into subtypes based on content with an accuracy of 92%. The system then uses a two-layer neural network to generate poetry based on word similarities and word movements in a 50-dimensional vector space.

The verses generated by the system are evaluated using rhyme, rhythm, syllable counts and stress patterns. These computational features of language are considered for generating haikus, limericks and iambic pentameter verses. The generated poems are evaluated using a Turing test on both experts and non-experts. The user study finds that only 38% computer generated poems were correctly identified by nonexperts while 65% of the computer generated poems were correctly identified by experts. Although the system does not pass the Turing test, the results from the Turing test suggest an improvement of over 17% when compared to previous methods which use Turing tests to evaluate poetry generators.

ContributorsMagge, Arjun (Author) / Syrotiuk, Violet R. (Thesis advisor) / Baral, Chitta (Committee member) / Hogue, Cynthia (Committee member) / Bazzi, Rida (Committee member) / Arizona State University (Publisher)

Created2016

Real time estimation and prediction of similarity in human activity using factor oracle algorithm

Description

The human motion is defined as an amalgamation of several physical traits such as bipedal locomotion, posture and manual dexterity, and mental expectation. In addition to the “positive” body form defined by these traits, casting light on the body produces a “negative” of the body: its shadow. We often interchangeably…

The human motion is defined as an amalgamation of several physical traits such as bipedal locomotion, posture and manual dexterity, and mental expectation. In addition to the “positive” body form defined by these traits, casting light on the body produces a “negative” of the body: its shadow. We often interchangeably use with silhouettes in the place of shadow to emphasize indifference to interior features. In a manner of speaking, the shadow is an alter ego that imitates the individual.

The principal value of shadow is its non-invasive behaviour of reflecting precisely the actions of the individual it is attached to. Nonetheless we can still think of the body’s shadow not as the body but its alter ego.

Based on this premise, my thesis creates an experiential system that extracts the data related to the contour of your human shape and gives it a texture and life of its own, so as to emulate your movements and postures, and to be your extension. In technical terms, my thesis extracts abstraction from a pre-indexed database that could be generated from an offline data set or in real time to complement these actions of a user in front of a low-cost optical motion capture device like the Microsoft Kinect. This notion could be the system’s interpretation of the action which creates modularized art through the abstraction’s ‘similarity’ to the live action.

Through my research, I have developed a stable system that tackles various connotations associated with shadows and the need to determine the ideal features that contribute to the relevance of the actions performed. The implication of Factor Oracle [3] pattern interpretation is tested with a feature bin of videos. The system also is flexible towards several methods of Nearest Neighbours searches and a machine learning module to derive the same output. The overall purpose is to establish this in real time and provide a constant feedback to the user. This can be expanded to handle larger dynamic data.

In addition to estimating human actions, my thesis best tries to test various Nearest Neighbour search methods in real time depending upon the data stream. This provides a basis to understand varying parameters that complement human activity recognition and feature matching in real time.

ContributorsSeshasayee, Sudarshan Prashanth (Author) / Sha, Xin Wei (Thesis advisor) / Turaga, Pavan (Thesis advisor) / Tinapple, David A (Committee member) / Arizona State University (Publisher)

Created2016

Attack detection for cyber systems and probabilistic state estimation in partially observable cyber environments

Description

Detecting cyber-attacks in cyber systems is essential for protecting cyber infrastructures from cyber-attacks. It is very difficult to detect cyber-attacks in cyber systems due to their high complexity. The accuracy of the attack detection in the cyber systems…

Detecting cyber-attacks in cyber systems is essential for protecting cyber infrastructures from cyber-attacks. It is very difficult to detect cyber-attacks in cyber systems due to their high complexity. The accuracy of the attack detection in the cyber systems depends heavily on the completeness of the collected sensor information. In this thesis, two approaches are presented: one to detecting attacks in completely observable cyber systems, and the other to estimating types of states in partially observable cyber systems for attack detection in cyber systems. These two approaches are illustrated using three large data sets of network traffic because the packet-level information of the network traffic data provides details about the cyber systems.

The approach to attack detection in cyber systems is based on a multimodal artificial neural network (MANN) using the collected network traffic data from completely observable cyber systems for training and testing. Since the training of MANN is computationally intensive, to reduce the computational overhead, an efficient feature selection algorithm using the genetic algorithm is developed and incorporated in this approach.

In order to detect attacks in cyber systems in partially observable environments, an approach to estimating the types of states in partially observable cyber systems, which is the first phase of attack detection in cyber systems in partially observable environments, is presented. The types of states of such cyber systems are useful to detecting cyber-attacks in such cyber systems. This approach involves the use of a convolutional neural network (CNN), and unsupervised learning with elbow method and k-means clustering algorithm.

ContributorsGuha, Sayantan (Author) / Yau, Stephen S. (Thesis advisor) / Ahn, Gail-Joon (Committee member) / Huang, Dijiang (Committee member) / Arizona State University (Publisher)

Created2016

A computational approach to relative image aesthetics

Description

Computational visual aesthetics has recently become an active research area. Existing state-of-art methods formulate this as a binary classification task where a given image is predicted to be beautiful or not. In many applications such as image retrieval and enhancement, it is more important to rank images based on their…

Computational visual aesthetics has recently become an active research area. Existing state-of-art methods formulate this as a binary classification task where a given image is predicted to be beautiful or not. In many applications such as image retrieval and enhancement, it is more important to rank images based on their aesthetic quality instead of binary-categorizing them. Furthermore, in such applications, it may be possible that all images belong to the same category. Hence determining the aesthetic ranking of the images is more appropriate. To this end, a novel problem of ranking images with respect to their aesthetic quality is formulated in this work. A new data-set of image pairs with relative labels is constructed by carefully selecting images from the popular AVA data-set. Unlike in aesthetics classification, there is no single threshold which would determine the ranking order of the images across the entire data-set.

This problem is attempted using a deep neural network based approach that is trained on image pairs by incorporating principles from relative learning. Results show that such relative training procedure allows the network to rank the images with a higher accuracy than a state-of-art network trained on the same set of images using binary labels. Further analyzing the results show that training a model using the image pairs learnt better aesthetic features than training on same number of individual binary labelled images.

Additionally, an attempt is made at enhancing the performance of the system by incorporating saliency related information. Given an image, humans might fixate their vision on particular parts of the image, which they might be subconsciously intrigued to. I therefore tried to utilize the saliency information both stand-alone as well as in combination with the global and local aesthetic features by performing two separate sets of experiments. In both the cases, a standard saliency model is chosen and the generated saliency maps are convoluted with the images prior to passing them to the network, thus giving higher importance to the salient regions as compared to the remaining. Thus generated saliency-images are either used independently or along with the global and the local features to train the network. Empirical results show that the saliency related aesthetic features might already be learnt by the network as a sub-set of the global features from automatic feature extraction, thus proving the redundancy of the additional saliency module.

ContributorsGattupalli, Jaya Vijetha (Author) / Li, Baoxin (Thesis advisor) / Davulcu, Hasan (Committee member) / Liang, Jianming (Committee member) / Arizona State University (Publisher)

Created2016

Bi-manual learning for a basketball playing robot

Description

Sports activities have been a cornerstone in the evolution of humankind through the ages from the ancient Roman empire to the Olympics in the 21st century. These activities have been used as a benchmark to evaluate the how humans have progressed through the sands of time. In the 21st century,…

Sports activities have been a cornerstone in the evolution of humankind through the ages from the ancient Roman empire to the Olympics in the 21st century. These activities have been used as a benchmark to evaluate the how humans have progressed through the sands of time. In the 21st century, machines along with the help of powerful computing and relatively new computing paradigms have made a good case for taking up the mantle. Even though machines have been able to perform complex tasks and maneuvers, they have struggled to match the dexterity, coordination, manipulability and acuteness displayed by humans. Bi-manual tasks are more complex and bring in additional variables like coordination into the task making it harder to evaluate.

A task capable of demonstrating the above skillset would be a good measure of the progress in the field of robotic technology. Therefore a dual armed robot has been built and taught to handle the ball and make the basket successfully thus demonstrating the capability of using both arms. A combination of machine learning techniques, Reinforcement learning, and Imitation learning has been used along with advanced optimization algorithms to accomplish the task.

ContributorsKalige, Nikhil (Author) / Amor, Heni Ben (Thesis advisor) / Shrivastava, Aviral (Committee member) / Zhang, Yu (Committee member) / Arizona State University (Publisher)

Created2016

Filtering by