Search Content

Towards Supporting Visual Question and Answering Applications

Description

Visual Question Answering (VQA) is a new research area involving technologies ranging from computer vision, natural language processing, to other sub-fields of artificial intelligence such as knowledge representation. The fundamental task is to take as input one image and one question (in text) related to the given image, and…

Visual Question Answering (VQA) is a new research area involving technologies ranging from computer vision, natural language processing, to other sub-fields of artificial intelligence such as knowledge representation. The fundamental task is to take as input one image and one question (in text) related to the given image, and to generate a textual answer to the input question. There are two key research problems in VQA: image understanding and the question answering. My research mainly focuses on developing solutions to support solving these two problems.

In image understanding, one important research area is semantic segmentation, which takes images as input and output the label of each pixel. As much manual work is needed to label a useful training set, typical training sets for such supervised approaches are always small. There are also approaches with relaxed labeling requirement, called weakly supervised semantic segmentation, where only image-level labels are needed. With the development of social media, there are more and more user-uploaded images available

on-line. Such user-generated content often comes with labels like tags and may be coarsely labelled by various tools. To use these information for computer vision tasks, I propose a new graphic model by considering the neighborhood information and their interactions to obtain the pixel-level labels of the images with only incomplete image-level labels. The method was evaluated on both synthetic and real images.

In question answering, my research centers on best answer prediction, which addressed two main research topics: feature design and model construction. In the feature design part, most existing work discussed how to design effective features for answer quality / best answer prediction. However, little work mentioned how to design features by considering the relationship between answers of one given question. To fill this research gap, I designed new features to help improve the prediction performance. In the modeling part, to employ the structure of the feature space, I proposed an innovative learning-to-rank model by considering the hierarchical lasso. Experiments with comparison with the state-of-the-art in the best answer prediction literature have confirmed

that the proposed methods are effective and suitable for solving the research task.

ContributorsTian, Qiongjie (Author) / Li, Baoxin (Thesis advisor) / Tong, Hanghang (Committee member) / Davulcu, Hasan (Committee member) / Yang, Yezhou (Committee member) / Arizona State University (Publisher)

Created2017

Compressive Visual Question Answering

Description

Compressive sensing theory allows to sense and reconstruct signals/images with lower sampling rate than Nyquist rate. Applications in resource constrained environment stand to benefit from this theory, opening up many possibilities for new applications at the same time. The traditional inference pipeline for computer vision sequence reconstructing the image from…

Compressive sensing theory allows to sense and reconstruct signals/images with lower sampling rate than Nyquist rate. Applications in resource constrained environment stand to benefit from this theory, opening up many possibilities for new applications at the same time. The traditional inference pipeline for computer vision sequence reconstructing the image from compressive measurements. However,the reconstruction process is a computationally expensive step that also provides poor results at high compression rate. There have been several successful attempts to perform inference tasks directly on compressive measurements such as activity recognition. In this thesis, I am interested to tackle a more challenging vision problem - Visual question answering (VQA) without reconstructing the compressive images. I investigate the feasibility of this problem with a series of experiments, and I evaluate proposed methods on a VQA dataset and discuss promising results and direction for future work.

ContributorsHuang, Li-Chin (Author) / Turaga, Pavan (Thesis advisor) / Yang, Yezhou (Committee member) / Li, Baoxin (Committee member) / Arizona State University (Publisher)

Created2017

Compressive Light Field Reconstruction using Deep Learning

Description

Light field imaging is limited in its computational processing demands of high

sampling for both spatial and angular dimensions. Single-shot light field cameras

sacrifice spatial resolution to sample angular viewpoints, typically by multiplexing

incoming rays onto a 2D sensor array. While this resolution can be recovered using

compressive sensing, these iterative solutions are slow…

Light field imaging is limited in its computational processing demands of high

sampling for both spatial and angular dimensions. Single-shot light field cameras

sacrifice spatial resolution to sample angular viewpoints, typically by multiplexing

incoming rays onto a 2D sensor array. While this resolution can be recovered using

compressive sensing, these iterative solutions are slow in processing a light field. We

present a deep learning approach using a new, two branch network architecture,

consisting jointly of an autoencoder and a 4D CNN, to recover a high resolution

4D light field from a single coded 2D image. This network decreases reconstruction

time significantly while achieving average PSNR values of 26-32 dB on a variety of

light fields. In particular, reconstruction time is decreased from 35 minutes to 6.7

minutes as compared to the dictionary method for equivalent visual quality. These

reconstructions are performed at small sampling/compression ratios as low as 8%,

allowing for cheaper coded light field cameras. We test our network reconstructions

on synthetic light fields, simulated coded measurements of real light fields captured

from a Lytro Illum camera, and real coded images from a custom CMOS diffractive

light field camera. The combination of compressive light field capture with deep

learning allows the potential for real-time light field video acquisition systems in the

future.

ContributorsGupta, Mayank (Author) / Turaga, Pavan (Thesis advisor) / Yang, Yezhou (Committee member) / Li, Baoxin (Committee member) / Arizona State University (Publisher)

Created2017

Word and Relation Embedding for Sentence Representation

Description

In recent years, several methods have been proposed to encode sentences into fixed length continuous vectors called sentence representation or sentence embedding. With the recent advancements in various deep learning methods applied in Natural Language Processing (NLP), these representations play a crucial role in tasks such as named entity recognition,…

In recent years, several methods have been proposed to encode sentences into fixed length continuous vectors called sentence representation or sentence embedding. With the recent advancements in various deep learning methods applied in Natural Language Processing (NLP), these representations play a crucial role in tasks such as named entity recognition, question answering and sentence classification.

Traditionally, sentence vector representations are learnt from its constituent word representations, also known as word embeddings. Various methods to learn the distributed representation (embedding) of words have been proposed using the notion of Distributional Semantics, i.e. “meaning of a word is characterized by the company it keeps”. However, principle of compositionality states that meaning of a sentence is a function of the meanings of words and also the way they are syntactically combined. In various recent methods for sentence representation, the syntactic information like dependency or relation between words have been largely ignored.

In this work, I have explored the effectiveness of sentence representations that are composed of the representation of both, its constituent words and the relations between the words in a sentence. The word and relation embeddings are learned based on their context. These general-purpose embeddings can also be used as off-the- shelf semantic and syntactic features for various NLP tasks. Similarity Evaluation tasks was performed on two datasets showing the usefulness of the learned word embeddings. Experiments were conducted on three different sentence classification tasks showing that our sentence representations outperform the original word-based sentence representations, when used with the state-of-the-art Neural Network architectures.

ContributorsRath, Trideep (Author) / Baral, Chitta (Thesis advisor) / Li, Baoxin (Committee member) / Yang, Yezhou (Committee member) / Arizona State University (Publisher)

Created2017

Toward Reliable Graph Matching: from Deterministic Optimization to Combinatorial Learning

Description

Graph matching is a fundamental but notoriously difficult problem due to its NP-hard nature, and serves as a cornerstone for a series of applications in machine learning and computer vision, such as image matching, dynamic routing, drug design, to name a few. Although there has been massive previous investigation on…

Graph matching is a fundamental but notoriously difficult problem due to its NP-hard nature, and serves as a cornerstone for a series of applications in machine learning and computer vision, such as image matching, dynamic routing, drug design, to name a few. Although there has been massive previous investigation on high-performance graph matching solvers, it still remains a challenging task to tackle the matching problem under real-world scenarios with severe graph uncertainty (e.g., noise, outlier, misleading or ambiguous link).In this dissertation, a main focus is to investigate the essence and propose solutions to graph matching with higher reliability under such uncertainty. To this end, the proposed research was conducted taking into account three perspectives related to reliable graph matching: modeling, optimization and learning. For modeling, graph matching is extended from typical quadratic assignment problem to a more generic mathematical model by introducing a specific family of separable function, achieving higher capacity and reliability. In terms of optimization, a novel high gradient-efficient determinant-based regularization technique is proposed in this research, showing high robustness against outliers. Then learning paradigm for graph matching under intrinsic combinatorial characteristics is explored. First, a study is conducted on the way of filling the gap between discrete problem and its continuous approximation under a deep learning framework. Then this dissertation continues to investigate the necessity of more reliable latent topology of graphs for matching, and propose an effective and flexible framework to obtain it. Coherent findings in this dissertation include theoretical study and several novel algorithms, with rich experiments demonstrating the effectiveness.

ContributorsYu, Tianshu (Author) / Li, Baoxin (Thesis advisor) / Wang, Yalin (Committee member) / Yang, Yezhou (Committee member) / Yang, Yingzhen (Committee member) / Arizona State University (Publisher)

Created2021

Vaccinia Virus’ E3 Protein Inhibits Cellular Recognition of Canonical dsRNA and ZRNA

Description

Poxviruses such as monkeypox virus (MPXV) are emerging zoonotic diseases. Compared to MPXV, Vaccinia virus (VACV) has reduced pathogenicity in humans and can be used as a partially protective vaccine against MPXV. While most orthopoxviruses have E3 protein homologues with highly similar N-termini, the MPXV homologue, F3, has a start…

Poxviruses such as monkeypox virus (MPXV) are emerging zoonotic diseases. Compared to MPXV, Vaccinia virus (VACV) has reduced pathogenicity in humans and can be used as a partially protective vaccine against MPXV. While most orthopoxviruses have E3 protein homologues with highly similar N-termini, the MPXV homologue, F3, has a start codon mutation leading to an N-terminal truncation of 37 amino acids. The VACV protein E3 consists of a dsRNA binding domain in its C-terminus which must be intact for pathogenicity in murine models and replication in cultured cells. The N-terminus of E3 contains a Z-form nucleic acid (ZNA) binding domain and is also required for pathogenicity in murine models. Poxviruses produce RNA transcripts that extend beyond the transcribed gene which can form double-stranded RNA (dsRNA). The innate immune system easily recognizes dsRNA through proteins such as protein kinase R (PKR). After comparing a vaccinia virus with a wild-type E3 protein (VACV WT) to one with an E3 N-terminal truncation of 37 amino acids (VACV E3Δ37N), phenotypic differences appeared in several cell lines. In HeLa cells and certain murine embryonic fibroblasts (MEFs), dsRNA recognition pathways such as PKR become activated during VACV E3Δ37N infections, unlike VACV WT. However, MPXV does not activate PKR in HeLa or MEF cells. Additional investigation determined that MPXV produces less dsRNA than VACV. VACV E3Δ37N was made more similar to MPXV by selecting mutants that produce less dsRNA. By producing less dsRNA, VACV E3Δ37N no longer activated PKR in HeLa or MEF cells, thus restoring the wild-type phenotype. Furthermore, in other cell lines such as L929 (also a murine fibroblast) VACV E3Δ37N, but not VACV WT infection leads to activation of DNA-dependent activator of IFN-regulatory factors (DAI) and induction of necroptotic cell death. The same low dsRNA mutants demonstrate that DAI activation and necroptotic induction is independent of classical dsRNA. Finally, investigations of spread in an animal model and replication in cell lines where both the PKR and DAI pathways are intact determined that inhibition of both pathways is required for VACV E3Δ37N to replicate.

ContributorsCotsmire, Samantha (Author) / Jacobs, Bertram L (Thesis advisor) / Varsani, Arvind (Committee member) / Hogue, Brenda (Committee member) / Haydel, Shelley (Committee member) / Arizona State University (Publisher)

Created2021

Towards Energy-efficient Visual Navigation: Sensor Quantization and Event-based Vision Pipelines

Description

Visual navigation is a useful and important task for a variety of applications. As the prevalence of robots increase, there is an increasing need for energy-efficient navigation methods as well. Many aspects of efficient visual navigation algorithms have been implemented in the literature, but there is a lack of work…

Visual navigation is a useful and important task for a variety of applications. As the prevalence of robots increase, there is an increasing need for energy-efficient navigation methods as well. Many aspects of efficient visual navigation algorithms have been implemented in the literature, but there is a lack of work on evaluation of the efficiency of the image sensors. In this thesis, two methods are evaluated: adaptive image sensor quantization for traditional camera pipelines as well as new event-based sensors for low-power computer vision.The first contribution in this thesis is an evaluation of performing varying levels of sensor linear and logarithmic quantization with the task of visual simultaneous localization and mapping (SLAM). This unconventional method can provide efficiency benefits with a trade off between accuracy of the task and energy-efficiency. A new sensor quantization method, gradient-based quantization, is introduced to improve the accuracy of the task. This method only lowers the bit level of parts of the image that are less likely to be important in the SLAM algorithm since lower bit levels signify better energy-efficiency, but worse task accuracy. The third contribution is an evaluation of the efficiency and accuracy of event-based camera intensity representations for the task of optical flow. The results of performing a learning based optical flow are provided for each of five different reconstruction methods along with ablation studies. Lastly, the challenges of an event feature-based SLAM system are presented with results demonstrating the necessity for high quality and high resolution event data. The work in this thesis provides studies useful for examining tradeoffs for an efficient visual navigation system with traditional and event vision sensors. The results of this thesis also provide multiple directions for future work.

ContributorsChristie, Olivia Catherine (Author) / Jayasuriya, Suren (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Yang, Yezhou (Committee member) / Arizona State University (Publisher)

Created2022

Topology Processing of Retinotopic Maps

Description

Retinotopic map, the map between visual inputs on the retina and neuronal activation in brain visual areas, is one of the central topics in visual neuroscience. For human observers, the map is typically obtained by analyzing functional magnetic resonance imaging (fMRI) signals of cortical responses to slowly moving visual stimuli…

Retinotopic map, the map between visual inputs on the retina and neuronal activation in brain visual areas, is one of the central topics in visual neuroscience. For human observers, the map is typically obtained by analyzing functional magnetic resonance imaging (fMRI) signals of cortical responses to slowly moving visual stimuli on the retina. Biological evidences show the retinotopic mapping is topology-preserving/topological (i.e. keep the neighboring relationship after human brain process) within each visual region. Unfortunately, due to limited spatial resolution and the signal-noise ratio of fMRI, state of art retinotopic map is not topological. The topic was to model the topology-preserving condition mathematically, fix non-topological retinotopic map with numerical methods, and improve the quality of retinotopic maps. The impose of topological condition, benefits several applications. With the topological retinotopic maps, one may have a better insight on human retinotopic maps, including better cortical magnification factor quantification, more precise description of retinotopic maps, and potentially better exam ways of in Ophthalmology clinic.

ContributorsTu, Yanshuai (Author) / Wang, Yalin (Thesis advisor) / Lu, Zhong-Lin (Committee member) / Crook, Sharon (Committee member) / Yang, Yezhou (Committee member) / Zhang, Yu (Committee member) / Arizona State University (Publisher)

Created2022

Novel Applications of Wastewater-based Epidemiology for Assessing Population Nutrition, Infectious Disease, and Chronic Illness

Description

Traditional public health strategies for assessing human behavior, exposure, and activity are considered resource-exhaustive, time-consuming, and expensive, warranting a need for alternative methods to enhance data acquisition and subsequent interventions. This dissertation critically evaluated the use of wastewater-based epidemiology (WBE) as an inclusive and non-invasive tool for conducting near real-time…

Traditional public health strategies for assessing human behavior, exposure, and activity are considered resource-exhaustive, time-consuming, and expensive, warranting a need for alternative methods to enhance data acquisition and subsequent interventions. This dissertation critically evaluated the use of wastewater-based epidemiology (WBE) as an inclusive and non-invasive tool for conducting near real-time population health assessments. A rigorous literature review was performed to gauge the current landscape of WBE to monitor for biomarkers indicative of diet, as well as exposure to estrogen-mimicking endocrine disrupting (EED) chemicals via route of ingestion. Wastewater-derived measurements of phytoestrogens from August 2017 through July 2019 (n = 156 samples) in a small sewer catchment revealed seasonal patterns, with highest average per capita consumption rates in January through March of each year (2018: 7.0 ± 2.0 mg d-1; 2019: 8.2 ± 2.3 mg d-1) and statistically significant differences (p = 0.01) between fall and winter (3.4 ± 1.2 vs. 6.1 ± 2.9 mg d-1; p ≤ 0.01) and spring and summer (5.6 ± 2.1 vs. 3.4 ± 1.5 mg d-1; p ≤ 0.01). Additional investigations, including a human gut microbial composition analysis of community wastewater, were performed to support a methodological framework for future implementation of WBE to assess population-level dietary behavior. In response to the COVID-19 global pandemic, a high-frequency, high-resolution sample collection approach with public data sharing was implemented throughout the City of Tempe, Arizona, and analyzed for SARS-CoV-2 (E gene) from April 2020 through March 2021 (n = 1,556 samples). Results indicate early warning capability during the first wave (June 2020) compared to newly reported clinical cases (8.5 ± 2.1 days), later transitioning to a slight lagging indicator in December/January 2020-21 (-2.0 ± 1.4 days). A viral hotspot from within a larger catchment area was detected, prompting targeted interventions to successfully mitigate community spread; reinforcing the importance of sample collection within the sewer infrastructure. I conclude that by working in tandem with traditional approaches, WBE can enlighten a comprehensive understanding of population health, with methods and strategies implemented in this work recommended for future expansion to produce timely, actionable data in support of public health.

ContributorsBowes, Devin Ashley (Author) / Halden, Rolf U (Thesis advisor) / Krajmalnik-Brown, Rosa (Thesis advisor) / Conroy-Ben, Otakuye (Committee member) / Varsani, Arvind (Committee member) / Whisner, Corrie (Committee member) / Arizona State University (Publisher)

Created2022

ARGOS: Adaptive Recognition and Gated Operation System for Real-time Vision Applications

Description

Deep neural network-based methods have been proved to achieve outstanding performance on object detection and classification tasks. Deep neural networks follow the ``deeper model with deeper confidence'' belief to gain a higher recognition accuracy. However, reducing these networks' computational costs remains a challenge, which impedes their deployment on embedded devices.…

Deep neural network-based methods have been proved to achieve outstanding performance on object detection and classification tasks. Deep neural networks follow the ``deeper model with deeper confidence'' belief to gain a higher recognition accuracy. However, reducing these networks' computational costs remains a challenge, which impedes their deployment on embedded devices. For instance, the intersection management of Connected Autonomous Vehicles (CAVs) requires running computationally intensive object recognition algorithms on low-power traffic cameras. This dissertation aims to study the effect of a dynamic hardware and software approach to address this issue. Characteristics of real-world applications can facilitate this dynamic adjustment and reduce the computation. Specifically, this dissertation starts with a dynamic hardware approach that adjusts itself based on the toughness of input and extracts deeper features if needed. Next, an adaptive learning mechanism has been studied that use extracted feature from previous inputs to improve system performance. Finally, a system (ARGOS) was proposed and evaluated that can be run on embedded systems while maintaining the desired accuracy. This system adopts shallow features at inference time, but it can switch to deep features if the system desires a higher accuracy. To improve the performance, ARGOS distills the temporal knowledge from deep features to the shallow system. Moreover, ARGOS reduces the computation furthermore by focusing on regions of interest. The response time and mean average precision are adopted for the performance evaluation to evaluate the proposed ARGOS system.

ContributorsFarhadi, Mohammad (Author) / Yang, Yezhou (Thesis advisor) / Vrudhula, Sarma (Committee member) / Wu, Carole-Jean (Committee member) / Ren, Yi (Committee member) / Arizona State University (Publisher)

Created2022

Filtering by