Search Content

Context recognition methods using audio signals for human-machine interaction

Description

Audio signals, such as speech and ambient sounds convey rich information pertaining to a user’s activity, mood or intent. Enabling machines to understand this contextual information is necessary to bridge the gap in human-machine interaction. This is challenging due to its subjective nature, hence, requiring sophisticated techniques. This dissertation presents…

Audio signals, such as speech and ambient sounds convey rich information pertaining to a user’s activity, mood or intent. Enabling machines to understand this contextual information is necessary to bridge the gap in human-machine interaction. This is challenging due to its subjective nature, hence, requiring sophisticated techniques. This dissertation presents a set of computational methods, that generalize well across different conditions, for speech-based applications involving emotion recognition and keyword detection, and ambient sounds-based applications such as lifelogging.

The expression and perception of emotions varies across speakers and cultures, thus, determining features and classification methods that generalize well to different conditions is strongly desired. A latent topic models-based method is proposed to learn supra-segmental features from low-level acoustic descriptors. The derived features outperform state-of-the-art approaches over multiple databases. Cross-corpus studies are conducted to determine the ability of these features to generalize well across different databases. The proposed method is also applied to derive features from facial expressions; a multi-modal fusion overcomes the deficiencies of a speech only approach and further improves the recognition performance.

Besides affecting the acoustic properties of speech, emotions have a strong influence over speech articulation kinematics. A learning approach, which constrains a classifier trained over acoustic descriptors, to also model articulatory data is proposed here. This method requires articulatory information only during the training stage, thus overcoming the challenges inherent to large-scale data collection, while simultaneously exploiting the correlations between articulation kinematics and acoustic descriptors to improve the accuracy of emotion recognition systems.

Identifying context from ambient sounds in a lifelogging scenario requires feature extraction, segmentation and annotation techniques capable of efficiently handling long duration audio recordings; a complete framework for such applications is presented. The performance is evaluated on real world data and accompanied by a prototypical Android-based user interface.

The proposed methods are also assessed in terms of computation and implementation complexity. Software and field programmable gate array based implementations are considered for emotion recognition, while virtual platforms are used to model the complexities of lifelogging. The derived metrics are used to determine the feasibility of these methods for applications requiring real-time capabilities and low power consumption.

ContributorsShah, Mohit (Author) / Spanias, Andreas (Thesis advisor) / Chakrabarti, Chaitali (Thesis advisor) / Berisha, Visar (Committee member) / Turaga, Pavan (Committee member) / Arizona State University (Publisher)

Created2015

Data-Driven Representation Learning in Multimodal Feature Fusion

Description

Modern machine learning systems leverage data and features from multiple modalities to gain more predictive power. In most scenarios, the modalities are vastly different and the acquired data are heterogeneous in nature. Consequently, building highly effective fusion algorithms is at the core to achieve improved model robustness and inferencing performance.…

Modern machine learning systems leverage data and features from multiple modalities to gain more predictive power. In most scenarios, the modalities are vastly different and the acquired data are heterogeneous in nature. Consequently, building highly effective fusion algorithms is at the core to achieve improved model robustness and inferencing performance. This dissertation focuses on the representation learning approaches as the fusion strategy. Specifically, the objective is to learn the shared latent representation which jointly exploit the structural information encoded in all modalities, such that a straightforward learning model can be adopted to obtain the prediction.

We first consider sensor fusion, a typical multimodal fusion problem critical to building a pervasive computing platform. A systematic fusion technique is described to support both multiple sensors and descriptors for activity recognition. Targeted to learn the optimal combination of kernels, Multiple Kernel Learning (MKL) algorithms have been successfully applied to numerous fusion problems in computer vision etc. Utilizing the MKL formulation, next we describe an auto-context algorithm for learning image context via the fusion with low-level descriptors. Furthermore, a principled fusion algorithm using deep learning to optimize kernel machines is developed. By bridging deep architectures with kernel optimization, this approach leverages the benefits of both paradigms and is applied to a wide variety of fusion problems.

In many real-world applications, the modalities exhibit highly specific data structures, such as time sequences and graphs, and consequently, special design of the learning architecture is needed. In order to improve the temporal modeling for multivariate sequences, we developed two architectures centered around attention models. A novel clinical time series analysis model is proposed for several critical problems in healthcare. Another model coupled with triplet ranking loss as metric learning framework is described to better solve speaker diarization. Compared to state-of-the-art recurrent networks, these attention-based multivariate analysis tools achieve improved performance while having a lower computational complexity. Finally, in order to perform community detection on multilayer graphs, a fusion algorithm is described to derive node embedding from word embedding techniques and also exploit the complementary relational information contained in each layer of the graph.

ContributorsSong, Huan (Author) / Spanias, Andreas (Thesis advisor) / Thiagarajan, Jayaraman (Committee member) / Berisha, Visar (Committee member) / Tepedelenlioğlu, Cihan (Committee member) / Arizona State University (Publisher)

Created2018

Joint Optimization of Quantization and Structured Sparsity for Compressed Deep Neural Networks

Description

Deep neural networks (DNN) have shown tremendous success in various cognitive tasks, such as image classification, speech recognition, etc. However, their usage on resource-constrained edge devices has been limited due to high computation and large memory requirement.

To overcome these challenges, recent works have extensively investigated model compression techniques such…

Deep neural networks (DNN) have shown tremendous success in various cognitive tasks, such as image classification, speech recognition, etc. However, their usage on resource-constrained edge devices has been limited due to high computation and large memory requirement.

To overcome these challenges, recent works have extensively investigated model compression techniques such as element-wise sparsity, structured sparsity and quantization. While most of these works have applied these compression techniques in isolation, there have been very few studies on application of quantization and structured sparsity together on a DNN model.

This thesis co-optimizes structured sparsity and quantization constraints on DNN models during training. Specifically, it obtains optimal setting of 2-bit weight and 2-bit activation coupled with 4X structured compression by performing combined exploration of quantization and structured compression settings. The optimal DNN model achieves 50X weight memory reduction compared to floating-point uncompressed DNN. This memory saving is significant since applying only structured sparsity constraints achieves 2X memory savings and only quantization constraints achieves 16X memory savings. The algorithm has been validated on both high and low capacity DNNs and on wide-sparse and deep-sparse DNN models. Experiments demonstrated that deep-sparse DNN outperforms shallow-dense DNN with varying level of memory savings depending on DNN precision and sparsity levels. This work further proposed a Pareto-optimal approach to systematically extract optimal DNN models from a huge set of sparse and dense DNN models. The resulting 11 optimal designs were further evaluated by considering overall DNN memory which includes activation memory and weight memory. It was found that there is only a small change in the memory footprint of the optimal designs corresponding to the low sparsity DNNs. However, activation memory cannot be ignored for high sparsity DNNs.

ContributorsSrivastava, Gaurav (Author) / Seo, Jae-Sun (Thesis advisor) / Chakrabarti, Chaitali (Committee member) / Berisha, Visar (Committee member) / Arizona State University (Publisher)

Created2018

Producing Acoustic-Prosodic Entrainment in a Robotic Learning Companion to Build Learner Rapport

Description

With advances in automatic speech recognition, spoken dialogue systems are assuming increasingly social roles. There is a growing need for these systems to be socially responsive, capable of building rapport with users. In human-human interactions, rapport is critical to patient-doctor communication, conflict resolution, educational interactions, and social engagement. Rapport between…

With advances in automatic speech recognition, spoken dialogue systems are assuming increasingly social roles. There is a growing need for these systems to be socially responsive, capable of building rapport with users. In human-human interactions, rapport is critical to patient-doctor communication, conflict resolution, educational interactions, and social engagement. Rapport between people promotes successful collaboration, motivation, and task success. Dialogue systems which can build rapport with their user may produce similar effects, personalizing interactions to create better outcomes.

This dissertation focuses on how dialogue systems can build rapport utilizing acoustic-prosodic entrainment. Acoustic-prosodic entrainment occurs when individuals adapt their acoustic-prosodic features of speech, such as tone of voice or loudness, to one another over the course of a conversation. Correlated with liking and task success, a dialogue system which entrains may enhance rapport. Entrainment, however, is very challenging to model. People entrain on different features in many ways and how to design entrainment to build rapport is unclear. The first goal of this dissertation is to explore how acoustic-prosodic entrainment can be modeled to build rapport.

Towards this goal, this work presents a series of studies comparing, evaluating, and iterating on the design of entrainment, motivated and informed by human-human dialogue. These models of entrainment are implemented in the dialogue system of a robotic learning companion. Learning companions are educational agents that engage students socially to increase motivation and facilitate learning. As a learning companion’s ability to be socially responsive increases, so do vital learning outcomes. A second goal of this dissertation is to explore the effects of entrainment on concrete outcomes such as learning in interactions with robotic learning companions.

This dissertation results in contributions both technical and theoretical. Technical contributions include a robust and modular dialogue system capable of producing prosodic entrainment and other socially-responsive behavior. One of the first systems of its kind, the results demonstrate that an entraining, social learning companion can positively build rapport and increase learning. This dissertation provides support for exploring phenomena like entrainment to enhance factors such as rapport and learning and provides a platform with which to explore these phenomena in future work.

ContributorsLubold, Nichola Anne (Author) / Walker, Erin (Thesis advisor) / Pon-Barry, Heather (Thesis advisor) / Litman, Diane (Committee member) / VanLehn, Kurt (Committee member) / Berisha, Visar (Committee member) / Arizona State University (Publisher)

Created2018

Helix: A First Game Retrospective

Description

The original version of Helix, the one I pitched when first deciding to make a video game
for my thesis, is an action-platformer, with the intent of metroidvania-style progression
and an interconnected world map.

The current version of Helix is a turn based role-playing game, with the intent of roguelike
gameplay and a dark…

The original version of Helix, the one I pitched when first deciding to make a video game
for my thesis, is an action-platformer, with the intent of metroidvania-style progression
and an interconnected world map.

The current version of Helix is a turn based role-playing game, with the intent of roguelike
gameplay and a dark fantasy theme. We will first be exploring the challenges that came
with programming my own game - not quite from scratch, but also without a prebuilt
engine - then transition into game design and how Helix has evolved from its original form
to what we see today.

ContributorsDiscipulo, Isaiah K (Author) / Meuth, Ryan (Thesis director) / Kobayashi, Yoshihiro (Committee member) / School of Mathematical and Statistical Sciences (Contributor) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2020-05

RecyclePlus: A Sustainability iOS Application

Description

RecyclePlus is an iOS mobile application that allows users to be knowledgeable in the realms of sustainability. It gives encourages users to be environmental responsible by providing them access to recycling information. In particular, it allows users to search up certain materials and learn about its recyclability and how to…

RecyclePlus is an iOS mobile application that allows users to be knowledgeable in the realms of sustainability. It gives encourages users to be environmental responsible by providing them access to recycling information. In particular, it allows users to search up certain materials and learn about its recyclability and how to properly dispose of the material. Some searches will show locations of facilities near users that collect certain materials and dispose of the materials properly. This is a full stack software project that explores open source software and APIs, UI/UX design, and iOS development.

ContributorsTran, Nikki (Author) / Ganesh, Tirupalavanam (Thesis director) / Meuth, Ryan (Committee member) / Watts College of Public Service & Community Solut (Contributor) / Department of Information Systems (Contributor) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2020-05

Blockchain Design and Simulation

Description

This paper details the specification and implementation of a single-machine blockchain simulator. It also includes a brief introduction on the history & underlying concepts of blockchain, with explanations on features such as decentralization, openness, trustlessness, and consensus. The introduction features a brief overview of public interest and current implementations of…

This paper details the specification and implementation of a single-machine blockchain simulator. It also includes a brief introduction on the history & underlying concepts of blockchain, with explanations on features such as decentralization, openness, trustlessness, and consensus. The introduction features a brief overview of public interest and current implementations of blockchain before stating potential use cases for blockchain simulation software. The paper then gives a brief literature review of blockchain's role, both as a disruptive technology and a foundational technology. The literature review also addresses the potential and difficulties regarding the use of blockchain in Internet of Things (IoT) networks, and also describes the limitations of blockchain in general regarding computational intensity, storage capacity, and network architecture. Next, the paper gives the specification for a generic blockchain structure, with summaries on the behaviors and purposes of transactions, blocks, nodes, miners, public & private key cryptography, signature validation, and hashing. Finally, the author gives an overview of their specific implementation of the blockchain using C/C++ and OpenSSL. The overview includes a brief description of all the classes and data structures involved in the implementation, including their function and behavior. While the implementation meets the requirements set forward in the specification, the results are more qualitative and intuitive, as time constraints did not allow for quantitative measurements of the network simulation. The paper concludes by discussing potential applications for the simulator, and the possibility for future hardware implementations of blockchain.

ContributorsRauschenbach, Timothy Rex (Author) / Vrudhula, Sarma (Thesis director) / Nakamura, Mutsumi (Committee member) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2017-12

Fielding an Autonomous Cobot in a University Environment: Engineering and Evaluation

Description

Many researchers aspire to create robotics systems that assist humans in common office tasks, especially by taking over delivery and messaging tasks. For meaningful interactions to take place, a mobile robot must be able to identify the humans it interacts with and communicate successfully with them. It must also be…

Many researchers aspire to create robotics systems that assist humans in common office tasks, especially by taking over delivery and messaging tasks. For meaningful interactions to take place, a mobile robot must be able to identify the humans it interacts with and communicate successfully with them. It must also be able to successfully navigate the office environment. While mobile robots are well suited for navigating and interacting with elements inside a deterministic office environment, attempting to interact with human beings in an office environment remains a challenge due to the limits on the amount of cost-efficient compute power onboard the robot. In this work, I propose the use of remote cloud services to offload intensive interaction tasks. I detail the interactions required in an office environment and discuss the challenges faced when implementing a human-robot interaction platform in a stochastic office environment. I also experiment with cloud services for facial recognition, speech recognition, and environment navigation and discuss my results. As part of my thesis, I have implemented a human-robot interaction system utilizing cloud APIs into a mobile robot, enabling it to navigate the office environment, identify humans within the environment, and communicate with these humans.

ContributorsDSouza, Daniel Anand (Author) / Kambhampati, Subbarao (Thesis director) / Zhang, Yu (Committee member) / Computer Science and Engineering Program (Contributor) / School of Computing, Informatics, and Decision Systems Engineering (Contributor) / Barrett, The Honors College (Contributor)

Created2017-05

Leye (Lie) Detector \u2014 A Study of Lie Detection using Eye Tracking, Facial Gestures, and EEG

Description

Lie detection is used prominently in contemporary society for many purposes such as for pre-employment screenings, granting security clearances, and determining if criminals or potential subjects may or may not be lying, but by no means is not limited to that scope. However, lie detection has been criticized for being…

Lie detection is used prominently in contemporary society for many purposes such as for pre-employment screenings, granting security clearances, and determining if criminals or potential subjects may or may not be lying, but by no means is not limited to that scope. However, lie detection has been criticized for being subjective, unreliable, inaccurate, and susceptible to deliberate manipulation. Furthermore, critics also believe that the administrator of the test also influences the outcome as well. As a result, the polygraph machine, the contemporary device used for lie detection, has come under scrutiny when used as evidence in the courts. The purpose of this study is to use three entirely different tools and concepts to determine whether eye tracking systems, electroencephalogram (EEG), and Facial Expression Emotion Analysis (FACET) are reliable tools for lie detection. This study found that certain constructs such as where the left eye is looking at in regard to its usual position and engagement levels in eye tracking and EEG respectively could distinguish between truths and lies. However, the FACET proved the most reliable tool out of the three by providing not just one distinguishing variable but seven, all related to emotions derived from movements in the facial muscles during the present study. The emotions associated with the FACET that were documented to possess the ability to distinguish between truthful and lying responses were joy, anger, fear, confusion, and frustration. In addition, an overall measure of the subject's neutral and positive emotional expression were found to be distinctive factors. The implications of this study and future directions are discussed.

ContributorsSeto, Raymond Hua (Author) / Atkinson, Robert (Thesis director) / Runger, George (Committee member) / W. P. Carey School of Business (Contributor) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2017-05

Visual Analytics and the Impact of Inter-Country Trade on Violence

Description

Global violent conflict has become an increasing problem in recent decades, especially in the African continent. Civil wars, terrorism, riots, and political violence has wrought havoc not only on civilian lives, but also on economic foundations. Trade networks are a way to measure these economic foundations. To summarize trade networks…

Global violent conflict has become an increasing problem in recent decades, especially in the African continent. Civil wars, terrorism, riots, and political violence has wrought havoc not only on civilian lives, but also on economic foundations. Trade networks are a way to measure these economic foundations. To summarize trade networks clustering coefficient as well as trade quantity/value summation measures are used. To understand effects of global trade on violent conflict, Pearson product-moment correlations are utilized. This work details a comparison of African national economies and violent conflict events using clustering coefficient, trade summation measures and Pearson correlation coefficient.

ContributorsKadambi, Sagarika Sanjay (Author) / Maciejewski, Ross (Thesis director) / Shutters, Shade (Committee member) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2017-05

Filtering by