Search Content

Multimodal Representation Learning for Visual Reasoning and Text-to-Image Translation

Description

Multimodal Representation Learning is a multi-disciplinary research field which aims to integrate information from multiple communicative modalities in a meaningful manner to help solve some downstream task. These modalities can be visual, acoustic, linguistic, haptic etc. The interpretation of ’meaningful integration of information from different modalities’ remains modality and task…

Multimodal Representation Learning is a multi-disciplinary research field which aims to integrate information from multiple communicative modalities in a meaningful manner to help solve some downstream task. These modalities can be visual, acoustic, linguistic, haptic etc. The interpretation of ’meaningful integration of information from different modalities’ remains modality and task dependent. The downstream task can range from understanding one modality in the presence of information from other modalities, to that of translating input from one modality to another. In this thesis the utility of multimodal representation learning for understanding one modality vis-à-vis Image Understanding for Visual Reasoning given corresponding information in other modalities, as well as translating from one modality to the other, specifically, Text to Image Translation was investigated.

Visual Reasoning has been an active area of research in computer vision. It encompasses advanced image processing and artificial intelligence techniques to locate, characterize and recognize objects, regions and their attributes in the image in order to comprehend the image itself. One way of building a visual reasoning system is to ask the system to answer questions about the image that requires attribute identification, counting, comparison, multi-step attention, and reasoning. An intelligent system is thought to have a proper grasp of the image if it can answer said questions correctly and provide a valid reasoning for the given answers. In this work how a system can be built by learning a multimodal representation between the stated image and the questions was investigated. Also, how background knowledge, specifically scene-graph information, if available, can be incorporated into existing image understanding models was demonstrated.

Multimodal learning provides an intuitive way of learning a joint representation between different modalities. Such a joint representation can be used to translate from one modality to the other. It also gives way to learning a shared representation between these varied modalities and allows to provide meaning to what this shared representation should capture. In this work, using the surrogate task of text to image translation, neural network based architectures to learn a shared representation between these two modalities was investigated. Also, the ability that such a shared representation is capable of capturing parts of different modalities that are equivalent in some sense is proposed. Specifically, given an image and a semantic description of certain objects present in the image, a shared representation between the text and the image modality capable of capturing parts of the image being mentioned in the text was demonstrated. Such a capability was showcased on a publicly available dataset.

ContributorsSaha, Rudra (Author) / Yang, Yezhou (Thesis advisor) / Singh, Maneesh Kumar (Committee member) / Baral, Chitta (Committee member) / Arizona State University (Publisher)

Created2018

Advances in Motion Estimators for Applications in Computer Vision

Description

Motion estimation is a core task in computer vision and many applications utilize optical flow methods as fundamental tools to analyze motion in images and videos. Optical flow is the apparent motion of objects in image sequences that results from relative motion between the objects and the imaging perspective. Today,…

Motion estimation is a core task in computer vision and many applications utilize optical flow methods as fundamental tools to analyze motion in images and videos. Optical flow is the apparent motion of objects in image sequences that results from relative motion between the objects and the imaging perspective. Today, optical flow fields are utilized to solve problems in various areas such as object detection and tracking, interpolation, visual odometry, etc. In this dissertation, three problems from different areas of computer vision and the solutions that make use of modified optical flow methods are explained.

The contributions of this dissertation are approaches and frameworks that introduce i) a new optical flow-based interpolation method to achieve minimally divergent velocimetry data, ii) a framework that improves the accuracy of change detection algorithms in synthetic aperture radar (SAR) images, and iii) a set of new methods to integrate Proton Magnetic Resonance Spectroscopy (1HMRSI) data into threedimensional (3D) neuronavigation systems for tumor biopsies.

In the first application an optical flow-based approach for the interpolation of minimally divergent velocimetry data is proposed. The velocimetry data of incompressible fluids contain signals that describe the flow velocity. The approach uses the additional flow velocity information to guide the interpolation process towards reduced divergence in the interpolated data.

In the second application a framework that mainly consists of optical flow methods and other image processing and computer vision techniques to improve object extraction from synthetic aperture radar images is proposed. The proposed framework is used for distinguishing between actual motion and detected motion due to misregistration in SAR image sets and it can lead to more accurate and meaningful change detection and improve object extraction from a SAR datasets.

In the third application a set of new methods that aim to improve upon the current state-of-the-art in neuronavigation through the use of detailed three-dimensional (3D) 1H-MRSI data are proposed. The result is a progressive form of online MRSI-guided neuronavigation that is demonstrated through phantom validation and clinical application.

ContributorsKanberoglu, Berkay (Author) / Frakes, David (Thesis advisor) / Turaga, Pavan (Thesis advisor) / Spanias, Andreas (Committee member) / Berisha, Visar (Committee member) / Arizona State University (Publisher)

Created2018

Computational Modeling and Analysis of Symmetry in Human Movements

Description

Human movement is a complex process influenced by physiological and psychological factors. The execution of movement is varied from person to person, and the number of possible strategies for completing a specific movement task is almost infinite. Different choices of strategies can be perceived by humans as having different degrees…

Human movement is a complex process influenced by physiological and psychological factors. The execution of movement is varied from person to person, and the number of possible strategies for completing a specific movement task is almost infinite. Different choices of strategies can be perceived by humans as having different degrees of quality, and the quality can be defined with regard to aesthetic, athletic, or health-related ratings. It is useful to measure and track the quality of a person's movements, for various applications, especially with the prevalence of low-cost and portable cameras and sensors today. Furthermore, based on such measurements, feedback systems can be designed for people to practice their movements towards certain goals. In this dissertation, I introduce symmetry as a family of measures for movement quality, and utilize recent advances in computer vision and differential geometry to model and analyze different types of symmetry in human movements. Movements are modeled as trajectories on different types of manifolds, according to the representations of movements from sensor data. The benefit of such a universal framework is that it can accommodate different existing and future features that describe human movements. The theory and tools developed in this dissertation will also be useful in other scientific areas to analyze symmetry from high-dimensional signals.

ContributorsWang, Qiao (Author) / Turaga, Pavan (Thesis advisor) / Spanias, Andreas (Committee member) / Srivastava, Anuj (Committee member) / Sha, Xin Wei (Committee member) / Arizona State University (Publisher)

Created2018

Generating Light Estimation for Mixed-reality Devices through Collaborative Visual Sensing

Description

Mixed reality mobile platforms co-locate virtual objects with physical spaces, creating immersive user experiences. To create visual harmony between virtual and physical spaces, the virtual scene must be accurately illuminated with realistic physical lighting. To this end, a system was designed that Generates Light Estimation Across Mixed-reality (GLEAM) devices to…

Mixed reality mobile platforms co-locate virtual objects with physical spaces, creating immersive user experiences. To create visual harmony between virtual and physical spaces, the virtual scene must be accurately illuminated with realistic physical lighting. To this end, a system was designed that Generates Light Estimation Across Mixed-reality (GLEAM) devices to continually sense realistic lighting of a physical scene in all directions. GLEAM optionally operate across multiple mobile mixed-reality devices to leverage collaborative multi-viewpoint sensing for improved estimation. The system implements policies that prioritize resolution, coverage, or update interval of the illumination estimation depending on the situational needs of the virtual scene and physical environment.

To evaluate the runtime performance and perceptual efficacy of the system, GLEAM was implemented on the Unity 3D Game Engine. The implementation was deployed on Android and iOS devices. On these implementations, GLEAM can prioritize dynamic estimation with update intervals as low as 15 ms or prioritize high spatial quality with update intervals of 200 ms. User studies across 99 participants and 26 scene comparisons reported a preference towards GLEAM over other lighting techniques in 66.67% of the presented augmented scenes and indifference in 12.57% of the scenes. A controlled lighting user study on 18 participants revealed a general preference for policies that strike a balance between resolution and update rate.

ContributorsPrakash, Siddhant (Author) / LiKamWa, Robert (Thesis advisor) / Yang, Yezhou (Thesis advisor) / Hansford, Dianne (Committee member) / Arizona State University (Publisher)

Created2018

Improving Intent Classication By Automatic Data Augmentation Using Word Sense Disambiguation

Description

Virtual digital assistants are automated software systems which assist humans by understanding natural languages such as English, either in voice or textual form. In recent times, a lot of digital applications have shifted towards providing a user experience using natural language interface. The change is brought up by the degree…

Virtual digital assistants are automated software systems which assist humans by understanding natural languages such as English, either in voice or textual form. In recent times, a lot of digital applications have shifted towards providing a user experience using natural language interface. The change is brought up by the degree of ease with which the virtual digital assistants such as Google Assistant and Amazon Alexa can be integrated into your application. These assistants make use of a Natural Language Understanding (NLU) system which acts as an interface to translate unstructured natural language data into a structured form. Such an NLU system uses an intent finding algorithm which gives a high-level idea or meaning of a user query, termed as intent classification. The intent classification step identifies the action(s) that a user wants the assistant to perform. The intent classification step is followed by an entity recognition step in which the entities in the utterance are identified on which the intended action is performed. This step can be viewed as a sequence labeling task which maps an input word sequence into a corresponding sequence of slot labels. This step is also termed as slot filling.

In this thesis, we improve the intent classification and slot filling in the virtual voice agents by automatic data augmentation. Spoken Language Understanding systems face the issue of data sparsity. The reason behind this is that it is hard for a human-created training sample to represent all the patterns in the language. Due to the lack of relevant data, deep learning methods are unable to generalize the Spoken Language Understanding model. This thesis expounds a way to overcome the issue of data sparsity in deep learning approaches on Spoken Language Understanding tasks. Here we have described the limitations in the current intent classifiers and how the proposed algorithm uses existing knowledge bases to overcome those limitations. The method helps in creating a more robust intent classifier and slot filling system.

ContributorsGarg, Prashant (Author) / Baral, Chitta (Thesis advisor) / Kumar, Hemanth (Committee member) / Yang, Yezhou (Committee member) / Arizona State University (Publisher)

Created2018

Adaptive Lighting for Data-Driven Non-Line-Of-Sight 3D Localization

Description

Non-line-of-sight (NLOS) imaging of objects not visible to either the camera or illumina-

tion source is a challenging task with vital applications including surveillance and robotics.

Recent NLOS reconstruction advances have been achieved using time-resolved measure-

ments. Acquiring these time-resolved measurements requires expensive and specialized

detectors and laser sources. In work proposes a data-driven…

Non-line-of-sight (NLOS) imaging of objects not visible to either the camera or illumina-

tion source is a challenging task with vital applications including surveillance and robotics.

Recent NLOS reconstruction advances have been achieved using time-resolved measure-

ments. Acquiring these time-resolved measurements requires expensive and specialized

detectors and laser sources. In work proposes a data-driven approach for NLOS 3D local-

ization requiring only a conventional camera and projector. The localisation is performed

using a voxelisation and a regression problem. Accuracy of greater than 90% is achieved

in localizing a NLOS object to a 5cm × 5cm × 5cm volume in real data. By adopting

the regression approach an object of width 10cm to localised to approximately 1.5cm. To

generalize to line-of-sight (LOS) scenes with non-planar surfaces, an adaptive lighting al-

gorithm is adopted. This algorithm, based on radiosity, identifies and illuminates scene

patches in the LOS which most contribute to the NLOS light paths, and can factor in sys-

tem power constraints. Improvements ranging from 6%-15% in accuracy with a non-planar

LOS wall using adaptive lighting is reported, demonstrating the advantage of combining

the physics of light transport with active illumination for data-driven NLOS imaging.

ContributorsChandran, Sreenithy (Author) / Jayasuriya, Suren (Thesis advisor) / Turaga, Pavan (Committee member) / Dasarathy, Gautam (Committee member) / Arizona State University (Publisher)

Created2019

Confocal Laser Endomicroscopy Image Analysis with Deep Convolutional Neural Networks

Description

Rapid intraoperative diagnosis of brain tumors is of great importance for planning treatment and guiding the surgeon about the extent of resection. Currently, the standard for the preliminary intraoperative tissue analysis is frozen section biopsy that has major limitations such as tissue freezing and cutting artifacts, sampling errors, lack of…

Rapid intraoperative diagnosis of brain tumors is of great importance for planning treatment and guiding the surgeon about the extent of resection. Currently, the standard for the preliminary intraoperative tissue analysis is frozen section biopsy that has major limitations such as tissue freezing and cutting artifacts, sampling errors, lack of immediate interaction between the pathologist and the surgeon, and time consuming.

Handheld, portable confocal laser endomicroscopy (CLE) is being explored in neurosurgery for its ability to image histopathological features of tissue at cellular resolution in real time during brain tumor surgery. Over the course of examination of the surgical tumor resection, hundreds to thousands of images may be collected. The high number of images requires significant time and storage load for subsequent reviewing, which motivated several research groups to employ deep convolutional neural networks (DCNNs) to improve its utility during surgery. DCNNs have proven to be useful in natural and medical image analysis tasks such as classification, object detection, and image segmentation.

This thesis proposes using DCNNs for analyzing CLE images of brain tumors. Particularly, it explores the practicality of DCNNs in three main tasks. First, off-the shelf DCNNs were used to classify images into diagnostic and non-diagnostic. Further experiments showed that both ensemble modeling and transfer learning improved the classifier’s accuracy in evaluating the diagnostic quality of new images at test stage. Second, a weakly-supervised learning pipeline was developed for localizing key features of diagnostic CLE images from gliomas. Third, image style transfer was used to improve the diagnostic quality of CLE images from glioma tumors by transforming the histology patterns in CLE images of fluorescein sodium-stained tissue into the ones in conventional hematoxylin and eosin-stained tissue slides.

These studies suggest that DCNNs are opted for analysis of CLE images. They may assist surgeons in sorting out the non-diagnostic images, highlighting the key regions and enhancing their appearance through pattern transformation in real time. With recent advances in deep learning such as generative adversarial networks and semi-supervised learning, new research directions need to be followed to discover more promises of DCNNs in CLE image analysis.

ContributorsIzady Yazdanabadi, Mohammadhassan (Author) / Preul, Mark (Thesis advisor) / Yang, Yezhou (Thesis advisor) / Nakaji, Peter (Committee member) / Vernon, Brent (Committee member) / Arizona State University (Publisher)

Created2019

Towards Addressing Key Visual Processing Challenges in Social Media Computing

Description

Visual processing in social media platforms is a key step in gathering and understanding information in the era of Internet and big data. Online data is rich in content, but its processing faces many challenges including: varying scales for objects of interest, unreliable and/or missing labels, the inadequacy of single…

Visual processing in social media platforms is a key step in gathering and understanding information in the era of Internet and big data. Online data is rich in content, but its processing faces many challenges including: varying scales for objects of interest, unreliable and/or missing labels, the inadequacy of single modal data and difficulty in analyzing high dimensional data. Towards facilitating the processing and understanding of online data, this dissertation primarily focuses on three challenges that I feel are of great practical importance: handling scale differences in computer vision tasks, such as facial component detection and face retrieval, developing efficient classifiers using partially labeled data and noisy data, and employing multi-modal models and feature selection to improve multi-view data analysis. For the first challenge, I propose a scale-insensitive algorithm to expedite and accurately detect facial landmarks. For the second challenge, I propose two algorithms that can be used to learn from partially labeled data and noisy data respectively. For the third challenge, I propose a new framework that incorporates feature selection modules into LDA models.

ContributorsZhou, Xu (Author) / Li, Baoxin (Thesis advisor) / Hsiao, Sharon (Committee member) / Davulcu, Hasan (Committee member) / Yang, Yezhou (Committee member) / Arizona State University (Publisher)

Created2018

Time-Varying Modeling of Glottal Source and Vocal Tract and Sequential Bayesian Estimation of Model Parameters for Speech Synthesis

Description

Speech is generated by articulators acting on

a phonatory source. Identification of this

phonatory source and articulatory geometry are

individually challenging and ill-posed

problems, called speech separation and

articulatory inversion, respectively.

There exists a trade-off

between decomposition and recovered

articulatory geometry due to multiple

possible mappings between an

articulatory configuration

and the speech produced. However, if measurements

are…

Speech is generated by articulators acting on

a phonatory source. Identification of this

phonatory source and articulatory geometry are

individually challenging and ill-posed

problems, called speech separation and

articulatory inversion, respectively.

There exists a trade-off

between decomposition and recovered

articulatory geometry due to multiple

possible mappings between an

articulatory configuration

and the speech produced. However, if measurements

are obtained only from a microphone sensor,

they lack any invasive insight and add

additional challenge to an already difficult

problem.

A joint non-invasive estimation

strategy that couples articulatory and

phonatory knowledge would lead to better

articulatory speech synthesis. In this thesis,

a joint estimation strategy for speech

separation and articulatory geometry recovery

is studied. Unlike previous

periodic/aperiodic decomposition methods that

use stationary speech models within a

frame, the proposed model presents a

non-stationary speech decomposition method.

A parametric glottal source model and an

articulatory vocal tract response are

represented in a dynamic state space formulation.

The unknown parameters of the

speech generation components are estimated

using sequential Monte Carlo methods

under some specific assumptions.

The proposed approach is compared with other

glottal inverse filtering methods,

including iterative adaptive inverse filtering,

state-space inverse filtering, and

the quasi-closed phase method.

ContributorsVenkataramani, Adarsh Akkshai (Author) / Papandreou-Suppappola, Antonia (Thesis advisor) / Bliss, Daniel W (Committee member) / Turaga, Pavan (Committee member) / Arizona State University (Publisher)

Created2018

Topological Descriptors for Parkinson's Disease Classification and Regression Analysis

Description

At present, the vast majority of human subjects with neurological disease are still diagnosed through in-person assessments and qualitative analysis of patient data. In this paper, we propose to use Topological Data Analysis (TDA) together with machine learning tools to automate the process of Parkinson’s disease classification and severity assessment.…

At present, the vast majority of human subjects with neurological disease are still diagnosed through in-person assessments and qualitative analysis of patient data. In this paper, we propose to use Topological Data Analysis (TDA) together with machine learning tools to automate the process of Parkinson’s disease classification and severity assessment. An automated, stable, and accurate method to evaluate Parkinson’s would be significant in streamlining diagnoses of patients and providing families more time for corrective measures. We propose a methodology which incorporates TDA into analyzing Parkinson’s disease postural shifts data through the representation of persistence images. Studying the topology of a system has proven to be invariant to small changes in data and has been shown to perform well in discrimination tasks. The contributions of the paper are twofold. We propose a method to 1) classify healthy patients from those afflicted by disease and 2) diagnose the severity of disease. We explore the use of the proposed method in an application involving a Parkinson’s disease dataset comprised of healthy-elderly, healthy-young and Parkinson’s disease patients.

ContributorsRahman, Farhan Nadir (Co-author) / Nawar, Afra (Co-author) / Turaga, Pavan (Thesis director) / Krishnamurthi, Narayanan (Committee member) / Electrical Engineering Program (Contributor) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2020-05

Theses and Dissertations

Filtering by

Multimodal Representation Learning for Visual Reasoning and Text-to-Image Translation

Advances in Motion Estimators for Applications in Computer Vision

Computational Modeling and Analysis of Symmetry in Human Movements

Generating Light Estimation for Mixed-reality Devices through Collaborative Visual Sensing

Improving Intent Classication By Automatic Data Augmentation Using Word Sense Disambiguation

Adaptive Lighting for Data-Driven Non-Line-Of-Sight 3D Localization

Confocal Laser Endomicroscopy Image Analysis with Deep Convolutional Neural Networks

Towards Addressing Key Visual Processing Challenges in Social Media Computing

Time-Varying Modeling of Glottal Source and Vocal Tract and Sequential Bayesian Estimation of Model Parameters for Speech Synthesis

Topological Descriptors for Parkinson's Disease Classification and Regression Analysis