Matching Items (80)
- Genre: Academic theses
- Creators: Turaga, Pavan
- Resource Type: Text
Audio signals, such as speech and ambient sounds convey rich information pertaining to a user’s activity, mood or intent. Enabling machines to understand this contextual information is necessary to bridge the gap in human-machine interaction. This is challenging due to its subjective nature, hence, requiring sophisticated techniques. This dissertation presents a set of computational methods, that generalize well across different conditions, for speech-based applications involving emotion recognition and keyword detection, and ambient sounds-based applications such as lifelogging.
The expression and perception of emotions varies across speakers and cultures, thus, determining features and classification methods that generalize well to different conditions is strongly desired. A latent topic models-based method is proposed to learn supra-segmental features from low-level acoustic descriptors. The derived features outperform state-of-the-art approaches over multiple databases. Cross-corpus studies are conducted to determine the ability of these features to generalize well across different databases. The proposed method is also applied to derive features from facial expressions; a multi-modal fusion overcomes the deficiencies of a speech only approach and further improves the recognition performance.
Besides affecting the acoustic properties of speech, emotions have a strong influence over speech articulation kinematics. A learning approach, which constrains a classifier trained over acoustic descriptors, to also model articulatory data is proposed here. This method requires articulatory information only during the training stage, thus overcoming the challenges inherent to large-scale data collection, while simultaneously exploiting the correlations between articulation kinematics and acoustic descriptors to improve the accuracy of emotion recognition systems.
Identifying context from ambient sounds in a lifelogging scenario requires feature extraction, segmentation and annotation techniques capable of efficiently handling long duration audio recordings; a complete framework for such applications is presented. The performance is evaluated on real world data and accompanied by a prototypical Android-based user interface.
The proposed methods are also assessed in terms of computation and implementation complexity. Software and field programmable gate array based implementations are considered for emotion recognition, while virtual platforms are used to model the complexities of lifelogging. The derived metrics are used to determine the feasibility of these methods for applications requiring real-time capabilities and low power consumption.
Image segmentation is of great importance and value in many applications. In computer vision, image segmentation is the tool and process of locating objects and boundaries within images. The segmentation result may provide more meaningful image data. Generally, there are two fundamental image segmentation algorithms: discontinuity and similarity. The idea behind discontinuity is locating the abrupt changes in intensity of images, as are often seen in edges or boundaries. Similarity subdivides an image into regions that fit the pre-defined criteria. The algorithm utilized in this thesis is the second category.
This study addresses the problem of particle image segmentation by measuring the similarity between a sampled region and an adjacent region, based on Bhattacharyya distance and an image feature extraction technique that uses distribution of local binary patterns and pattern contrasts. A boundary smoothing process is developed to improve the accuracy of the segmentation. The novel particle image segmentation algorithm is tested using four different cases of particle image velocimetry (PIV) images. The obtained experimental results of segmentations provide partitioning of the objects within 10 percent error rate. Ground-truth segmentation data, which are manually segmented image from each case, are used to calculate the error rate of the segmentations.
Advancements in mobile technologies have significantly enhanced the capabilities of mobile devices to serve as powerful platforms for sensing, processing, and visualization. Surges in the sensing technology and the abundance of data have enabled the use of these portable devices for real-time data analysis and decision-making in digital signal processing (DSP) applications. Most of the current efforts in DSP education focus on building tools to facilitate understanding of the mathematical principles. However, there is a disconnect between real-world data processing problems and the material presented in a DSP course. Sophisticated mobile interfaces and apps can potentially play a crucial role in providing a hands-on-experience with modern DSP applications to students. In this work, a new paradigm of DSP learning is explored by building an interactive easy-to-use health monitoring application for use in DSP courses. This is motivated by the increasing commercial interest in employing mobile phones for real-time health monitoring tasks. The idea is to exploit the computational abilities of the Android platform to build m-Health modules with sensor interfaces. In particular, appropriate sensing modalities have been identified, and a suite of software functionalities have been developed. Within the existing framework of the AJDSP app, a graphical programming environment, interfaces to on-board and external sensor hardware have also been developed to acquire and process physiological data. The set of sensor signals that can be monitored include electrocardiogram (ECG), photoplethysmogram (PPG), accelerometer signal, and galvanic skin response (GSR). The proposed m-Health modules can be used to estimate parameters such as heart rate, oxygen saturation, step count, and heart rate variability. A set of laboratory exercises have been designed to demonstrate the use of these modules in DSP courses. The app was evaluated through several workshops involving graduate and undergraduate students in signal processing majors at Arizona State University. The usefulness of the software modules in enhancing student understanding of signals, sensors and DSP systems were analyzed. Student opinions about the app and the proposed m-health modules evidenced the merits of integrating tools for mobile sensing and processing in a DSP curriculum, and familiarizing students with challenges in modern data-driven applications.
One of the main challenges in planetary robotics is to traverse the shortest path through a set of waypoints. The shortest distance between any two waypoints is a direct linear traversal. Often times, there are physical restrictions that prevent a rover form traversing straight to a waypoint. Thus, knowledge of the terrain is needed prior to traversal. The Digital Terrain Model (DTM) provides information about the terrain along with waypoints for the rover to traverse. However, traversing a set of waypoints linearly is burdensome, as the rovers would constantly need to modify their orientation as they successively approach waypoints. Although there are various solutions to this problem, this research paper proposes the smooth traversability of the rover using splines as a quick and easy implementation to traverse a set of waypoints. In addition, a rover was used to compare the smoothness of the linear traversal along with the spline interpolations. The data collected illustrated that spline traversals had a less rate of change in the velocity over time, indicating that the rover performed smoother than with linear paths.
Continuous monitoring of sensor data from smart phones to identify human activities and gestures, puts a heavy load on the smart phone's power consumption. In this research study, the non-Euclidean geometry of the rich sensor data obtained from the user's smart phone is utilized to perform compressive analysis and efficient classification of human activities by employing machine learning techniques. We are interested in the generalization of classical tools for signal approximation to newer spaces, such as rotation data, which is best studied in a non-Euclidean setting, and its application to activity analysis. Attributing to the non-linear nature of the rotation data space, which involve a heavy overload on the smart phone's processor and memory as opposed to feature extraction on the Euclidean space, indexing and compaction of the acquired sensor data is performed prior to feature extraction, to reduce CPU overhead and thereby increase the lifetime of the battery with a little loss in recognition accuracy of the activities. The sensor data represented as unit quaternions, is a more intrinsic representation of the orientation of smart phone compared to Euler angles (which suffers from Gimbal lock problem) or the computationally intensive rotation matrices. Classification algorithms are employed to classify these manifold sequences in the non-Euclidean space. By performing customized indexing (using K-means algorithm) of the evolved manifold sequences before feature extraction, considerable energy savings is achieved in terms of smart phone's battery life.
As a promising solution to the problem of acquiring and storing large amounts of image and video data, spatial-multiplexing camera architectures have received lot of attention in the recent past. Such architectures have the attractive feature of combining a two-step process of acquisition and compression of pixel measurements in a conventional camera, into a single step. A popular variant is the single-pixel camera that obtains measurements of the scene using a pseudo-random measurement matrix. Advances in compressive sensing (CS) theory in the past decade have supplied the tools that, in theory, allow near-perfect reconstruction of an image from these measurements even for sub-Nyquist sampling rates. However, current state-of-the-art reconstruction algorithms suffer from two drawbacks -- They are (1) computationally very expensive and (2) incapable of yielding high fidelity reconstructions for high compression ratios. In computer vision, the final goal is usually to perform an inference task using the images acquired and not signal recovery. With this motivation, this thesis considers the possibility of inference directly from compressed measurements, thereby obviating the need to use expensive reconstruction algorithms. It is often the case that non-linear features are used for inference tasks in computer vision. However, currently, it is unclear how to extract such features from compressed measurements. Instead, using the theoretical basis provided by the Johnson-Lindenstrauss lemma, discriminative features using smashed correlation filters are derived and it is shown that it is indeed possible to perform reconstruction-free inference at high compression ratios with only a marginal loss in accuracy. As a specific inference problem in computer vision, face recognition is considered, mainly beyond the visible spectrum such as in the short wave infra-red region (SWIR), where sensors are expensive.
Computational thinking, the fundamental way of thinking in computer science, including information sourcing and problem solving behind programming, is considered vital to children who live in a digital era. Most of current educational games designed to teach children about coding either rely on external curricular materials or are too complicated to work well with young children. In this thesis project, Guardy, an iOS tower defense game, was developed to help children over 8 years old learn about and practice using basic concepts in programming. The game is built with the SpriteKit, a graphics rendering and animation infrastructure in Apple’s integrated development environment Xcode. It simplifies switching among different game scenes and animating game sprites in the development. In a typical game, a sequence of operations is arranged by players to destroy incoming enemy minions. Basic coding concepts like looping, sequencing, conditionals, and classification are integrated in different levels. In later levels, players are required to type in commands and put them in an order to keep playing the game. To reduce the difficulty of the usability testing, a method combining questionnaires and observation was conducted with two groups of college students who either have no programming experience or are familiar with coding. The results show that Guardy has the potential to help children learn programming and practice computational thinking.
When dancers are granted agency over music, as in interactive dance systems, the actors are most often concerned with the problem of creating a staged performance for an audience. However, as is reflected by the above quote, the practice of Argentine tango social dance is most concerned with participants internal experience and their relationship to the broader tango community. In this dissertation I explore creative approaches to enrich the sense of connection, that is, the experience of oneness with a partner and complete immersion in music and dance for Argentine tango dancers by providing agency over musical activities through the use of interactive technology. Specifically, I create an interactive dance system that allows tango dancers to affect and create music via their movements in the context of social dance. The motivations for this work are multifold: 1) to intensify embodied experience of the interplay between dance and music, individual and partner, couple and community, 2) to create shared experience of the conventions of tango dance, and 3) to innovate Argentine tango social dance practice for the purposes of education and increasing musicality in dancers.
The performance of most of the visual computing tasks depends on the quality of the features extracted from the raw data. Insightful feature representation increases the performance of many learning algorithms by exposing the underlying explanatory factors of the output for the unobserved input. A good representation should also handle anomalies in the data such as missing samples and noisy input caused by the undesired, external factors of variation. It should also reduce the data redundancy. Over the years, many feature extraction processes have been invented to produce good representations of raw images and videos.
The feature extraction processes can be categorized into three groups. The first group contains processes that are hand-crafted for a specific task. Hand-engineering features requires the knowledge of domain experts and manual labor. However, the feature extraction process is interpretable and explainable. Next group contains the latent-feature extraction processes. While the original feature lies in a high-dimensional space, the relevant factors for a task often lie on a lower dimensional manifold. The latent-feature extraction employs hidden variables to expose the underlying data properties that cannot be directly measured from the input. Latent features seek a specific structure such as sparsity or low-rank into the derived representation through sophisticated optimization techniques. The last category is that of deep features. These are obtained by passing raw input data with minimal pre-processing through a deep network. Its parameters are computed by iteratively minimizing a task-based loss.
In this dissertation, I present four pieces of work where I create and learn suitable data representations. The first task employs hand-crafted features to perform clinically-relevant retrieval of diabetic retinopathy images. The second task uses latent features to perform content-adaptive image enhancement. The third task ranks a pair of images based on their aestheticism. The goal of the last task is to capture localized image artifacts in small datasets with patch-level labels. For both these tasks, I propose novel deep architectures and show significant improvement over the previous state-of-art approaches. A suitable combination of feature representations augmented with an appropriate learning approach can increase performance for most visual computing tasks.
Computer Vision as a eld has gone through signicant changes in the last decade.
The eld has seen tremendous success in designing learning systems with hand-crafted
features and in using representation learning to extract better features. In this dissertation
some novel approaches to representation learning and task learning are studied.
Multiple-instance learning which is generalization of supervised learning, is one
example of task learning that is discussed. In particular, a novel non-parametric k-
NN-based multiple-instance learning is proposed, which is shown to outperform other
existing approaches. This solution is applied to a diabetic retinopathy pathology
detection problem eectively.
In cases of representation learning, generality of neural features are investigated
rst. This investigation leads to some critical understanding and results in feature
generality among datasets. The possibility of learning from a mentor network instead
of from labels is then investigated. Distillation of dark knowledge is used to eciently
mentor a small network from a pre-trained large mentor network. These studies help
in understanding representation learning with smaller and compressed networks.