Search Content

An Investigation into Modern Facial Expressions Recognition by a Computer

Description

Facial Expressions Recognition using the Convolution Neural Network has been actively researched upon in the last decade due to its high number of applications in the human-computer interaction domain. As Convolution Neural Networks have the exceptional ability to learn, they outperform the methods using handcrafted features. Though the state-of-the-art models…

Facial Expressions Recognition using the Convolution Neural Network has been actively researched upon in the last decade due to its high number of applications in the human-computer interaction domain. As Convolution Neural Networks have the exceptional ability to learn, they outperform the methods using handcrafted features. Though the state-of-the-art models achieve high accuracy on the lab-controlled images, they still struggle for the wild expressions. Wild expressions are captured in a real-world setting and have natural expressions. Wild databases have many challenges such as occlusion, variations in lighting conditions and head poses. In this work, I address these challenges and propose a new model containing a Hybrid Convolutional Neural Network with a Fusion Layer. The Fusion Layer utilizes a combination of the knowledge obtained from two different domains for enhanced feature extraction from the in-the-wild images. I tested my network on two publicly available in-the-wild datasets namely RAF-DB and AffectNet. Next, I tested my trained model on CK+ dataset for the cross-database evaluation study. I prove that my model achieves comparable results with state-of-the-art methods. I argue that it can perform well on such datasets because it learns the features from two different domains rather than a single domain. Last, I present a real-time facial expression recognition system as a part of this work where the images are captured in real-time using laptop camera and passed to the model for obtaining a facial expression label for it. It indicates that the proposed model has low processing time and can produce output almost instantly.

ContributorsChhabra, Sachin (Author) / Li, Baoxin (Thesis advisor) / Venkateswara, Hemanth (Committee member) / Srivastava, Siddharth (Committee member) / Arizona State University (Publisher)

Created2019

Video Captioning with Commonsense Knowledge Anchors

Description

It is not merely an aggregation of static entities that a video clip carries, but alsoa variety of interactions and relations among these entities. Challenges still remain for a video captioning system to generate natural language descriptions focusing on the prominent interest and aligning with the latent aspects beyond observations. This work presents…

It is not merely an aggregation of static entities that a video clip carries, but alsoa variety of interactions and relations among these entities. Challenges still remain for a video captioning system to generate natural language descriptions focusing on the prominent interest and aligning with the latent aspects beyond observations. This work presents a Commonsense knowledge Anchored Video cAptioNing (dubbed as CAVAN) approach. CAVAN exploits inferential commonsense knowledge to assist the training of video captioning model with a novel paradigm for sentence-level semantic alignment. Specifically, commonsense knowledge is queried to complement per training caption by querying a generic knowledge atlas ATOMIC, and form the commonsense- caption entailment corpus. A BERT based language entailment model trained from this corpus then serves as a commonsense discriminator for the training of video captioning model, and penalizes the model from generating semantically misaligned captions. With extensive empirical evaluations on MSR-VTT, V2C and VATEX datasets, CAVAN consistently improves the quality of generations and shows higher keyword hit rate. Experimental results with ablations validate the effectiveness of CAVAN and reveals that the use of commonsense knowledge contributes to the video caption generation.

ContributorsShao, Huiliang (Author) / Yang, Yezhou (Thesis advisor) / Jayasuriya, Suren (Committee member) / Xiao, Chaowei (Committee member) / Arizona State University (Publisher)

Created2022

Why Pop? A System to Explain How Deep Learning Models Classify Music

Description

The impact of Artificial Intelligence (AI) has increased significantly in daily life. AI is taking big strides towards moving into areas of life that are critical such as healthcare but, also into areas such as entertainment and leisure. Deep neural networks have been pivotal in making all these advancements possible.…

The impact of Artificial Intelligence (AI) has increased significantly in daily life. AI is taking big strides towards moving into areas of life that are critical such as healthcare but, also into areas such as entertainment and leisure. Deep neural networks have been pivotal in making all these advancements possible. But, a well-known problem with deep neural networks is the lack of explanations for the choices it makes. To combat this, several methods have been tried in the field of research. One example of this is assigning rankings to the individual features and how influential they are in the decision-making process. In contrast a newer class of methods focuses on Concept Activation Vectors (CAV) which focus on extracting higher-level concepts from the trained model to capture more information as a mixture of several features and not just one. The goal of this thesis is to employ concepts in a novel domain: to explain how a deep learning model uses computer vision to classify music into different genres. Due to the advances in the field of computer vision with deep learning for classification tasks, it is rather a standard practice now to convert an audio clip into corresponding spectrograms and use those spectrograms as image inputs to the deep learning model. Thus, a pre-trained model can classify the spectrogram images (representing songs) into musical genres. The proposed explanation system called “Why Pop?” tries to answer certain questions about the classification process such as what parts of the spectrogram influence the model the most, what concepts were extracted and how are they different for different classes. These explanations aid the user gain insights into the model’s learnings, biases, and the decision-making process.

ContributorsSharma, Shubham (Author) / Bryan, Chris (Thesis advisor) / McDaniel, Troy (Committee member) / Sarwat, Mohamed (Committee member) / Arizona State University (Publisher)

Created2022

Traffic Accident Reconstruction Using Monocular Dashcam Videos

Description

Automated driving systems (ADS) have come a long way since their inception. It is clear that these systems rely heavily on stochastic deep learning techniques for perception, planning, and prediction, as it is impossible to construct every possible driving scenario to generate driving policies. Moreover, these systems need to be…

Automated driving systems (ADS) have come a long way since their inception. It is clear that these systems rely heavily on stochastic deep learning techniques for perception, planning, and prediction, as it is impossible to construct every possible driving scenario to generate driving policies. Moreover, these systems need to be trained and validated extensively on typical and abnormal driving situations before they can be trusted with human life. However, most publicly available driving datasets only consist of typical driving behaviors. On the other hand, there is a plethora of videos available on the internet that capture abnormal driving scenarios, but they are unusable for ADS training or testing as they lack important information such as camera calibration parameters, and annotated vehicle trajectories. This thesis proposes a new toolbox, DeepCrashTest-V2, that is capable of reconstructing high-quality simulations from monocular dashcam videos found on the internet. The toolbox not only estimates the crucial parameters such as camera calibration, ego-motion, and surrounding road user trajectories but also creates a virtual world in Car Learning to Act (CARLA) using data from OpenStreetMaps to simulate the estimated trajectories. The toolbox is open-source and is made available in the form of a python package on GitHub at https://github.com/C-Aniruddh/deepcrashtest_v2.

ContributorsChandratre, Aniruddh Vinay (Author) / Fainekos, Georgios (Thesis advisor) / Ben Amor, Hani (Thesis advisor) / Pedrielli, Giulia (Committee member) / Arizona State University (Publisher)

Created2022

A Performance Study of Different Deep Learning Architectures For Detecting Construction Equipment in Sites

Description

There are relatively few available construction equipment detectors models thatuse deep learning architectures; many of these use old object detection architectures like CNN (Convolutional Neural Networks), RCNN (Region-Based Convolutional Neural Network), and early versions of You Only Look Once (YOLO) V1. It can be challenging to deploy these models in practice for tracking…

There are relatively few available construction equipment detectors models thatuse deep learning architectures; many of these use old object detection architectures like CNN (Convolutional Neural Networks), RCNN (Region-Based Convolutional Neural Network), and early versions of You Only Look Once (YOLO) V1. It can be challenging to deploy these models in practice for tracking construction equipment while working on site. This thesis aims to provide a clear guide on how to train and evaluate the performance of different deep learning architecture models to detect different kinds of construction equipment on-site using two You Only Look Once (YOLO) architecturesYOLO v5s and YOLO R to detect three classes of different construction equipment onsite, including Excavators, Dump Trucks, and Loaders. The thesis also provides a simple solution to deploy the trained models. Additionally, this thesis describes a specialized, high-quality dataset with three thousand pictures created to train these models on real data by considering a typical worksite scene, various motions, varying perspectives, and angles of construction equipment on the site. The results presented herein show that after 150 epochs of training, the YOLORP6 has the best mAP at 0.981, while the YOLO v5s mAP is 0.936. However, YOLO v5s had the fastest and the shortest training time on Tesla P100 GPU as a processing unit on the Google Colab notebook. The YOLOv5s needed 4 hours and 52 minutes, but the YOLOR-P6 needed 14 hours and 35 minutes to finish the training.ii The final findings of this study show that the YOLOv5s model is the most efficient model to use when building an artificial intelligence model to detect construction equipment because of the size of its weights file relative to other versions of YOLO models- 14.4 MB for YOLOV5s vs. 288 MB for YOLOR-P6. This hugely impacts the processing unit’s performance, which is used to predict the construction equipment on site. In addition, the constructed database is published on a public dataset on the Roboflow platform, which can be used later as a foundation for future research and improvement for the newer deep learning architectures.

Contributorssabek, mohamed mamdooh (Author) / Parrish, Kristen (Thesis advisor) / Czerniawski, Thomas (Committee member) / Ayer, Steven K (Committee member) / Arizona State University (Publisher)

Created2022

Examining Data Integration with Schema Changes Based on Cell-level Mapping Using Deep Learning Models

Description

Artificial Intelligence, as the hottest research topic nowadays, is mostly driven by data. There is no doubt that data is the king in the age of AI. However, natural high-quality data is precious and rare. In order to obtain enough and eligible data to support AI tasks, data processing is…

Artificial Intelligence, as the hottest research topic nowadays, is mostly driven by data. There is no doubt that data is the king in the age of AI. However, natural high-quality data is precious and rare. In order to obtain enough and eligible data to support AI tasks, data processing is always required. To be even worse, the data preprocessing tasks are often dull and heavy, which require huge human labors to deal with. Statistics show 70% - 80% of the data scientists' time is spent on data integration process. Among various reasons, schema changes that commonly exist in the data warehouse are one significant obstacle that impedes the automation of the end-to-end data integration process. Traditional data integration applications rely on data processing operators such as join, union, aggregation and so on. Those operations are fragile and can be easily interrupted by schema changes. Whenever schema changes happen, the data integration applications will require human labors to solve the interruptions and downtime. The industries as well as the data scientists need a new mechanism to handle the schema changes in data integration tasks. This work proposes a new direction of data integration applications based on deep learning models. The data integration problem is defined in the scenario of integrating tabular-format data with natural schema changes, using the cell-based data abstraction. In addition, data augmentation and adversarial learning are investigated to boost the model robustness to schema changes. The experiments are tested on two real-world data integration scenarios, and the results demonstrate the effectiveness of the proposed approach.

ContributorsWang, Zijie (Author) / Zou, Jia (Thesis advisor) / Baral, Chitta (Committee member) / Candan, K. Selcuk (Committee member) / Arizona State University (Publisher)

Created2021

Machine Learning and Vision Using Edge Devices for Multimodal Chatbots and Bio-meteorological Sensing

Description

Machine learning (ML) and deep learning (DL) has become an intrinsic part of multiple fields. The ability to solve complex problems makes machine learning a panacea. In the last few years, there has been an explosion of data generation, which has greatly improvised machine learning models. But this comes with…

Machine learning (ML) and deep learning (DL) has become an intrinsic part of multiple fields. The ability to solve complex problems makes machine learning a panacea. In the last few years, there has been an explosion of data generation, which has greatly improvised machine learning models. But this comes with a cost of high computation, which invariably increases power usage and cost of the hardware. In this thesis we explore applications of ML techniques, applied to two completely different fields - arts, media and theater and urban climate research using low-cost and low-powered edge devices. The multi-modal chatbot uses different machine learning techniques: natural language processing (NLP) and computer vision (CV) to understand inputs of the user and accordingly perform in the play and interact with the audience. This system is also equipped with other interactive hardware setups like movable LED systems, together they provide an experiential theatrical play tailored to each user. I will discuss how I used edge devices to achieve this AI system which has created a new genre in theatrical play. I will then discuss MaRTiny, which is an AI-based bio-meteorological system that calculates mean radiant temperature (MRT), which is an important parameter for urban climate research. It is also equipped with a vision system that performs different machine learning tasks like pedestrian and shade detection. The entire system costs around $200 which can potentially replace the existing setup worth $20,000. I will further discuss how I overcame the inaccuracies in MRT value caused by the system, using machine learning methods. These projects although belonging to two very different fields, are implemented using edge devices and use similar ML techniques. In this thesis I will detail out different techniques that are shared between these two projects and how they can be used in several other applications using edge devices.

ContributorsKulkarni, Karthik Kashinath (Author) / Jayasuriya, Suren (Thesis advisor) / Middel, Ariane (Thesis advisor) / Yu, Hongbin (Committee member) / Arizona State University (Publisher)

Created2021

Characterization of Human Postural Balance under Compliance and Deep Learning for Predicting Environmental Conditions during Postural Balance

Description

This thesis work presents two separate studies:The first study assesses standing balance under various 2-dimensional (2D) compliant environments simulated using a dual-axis robotic platform and vision conditions. Directional virtual time-to-contact (VTC) measures were introduced to better characterize postural balance from both temporal and spatial aspects, and enable prediction of fall-relevant…

This thesis work presents two separate studies:The first study assesses standing balance under various 2-dimensional (2D) compliant environments simulated using a dual-axis robotic platform and vision conditions. Directional virtual time-to-contact (VTC) measures were introduced to better characterize postural balance from both temporal and spatial aspects, and enable prediction of fall-relevant directions. Twenty healthy young adults were recruited to perform quiet standing tasks on the platform. Conventional stability measures, namely center-of-pressure (COP) path length and COP area, were also adopted for further comparisons with the proposed VTC. The results indicated that postural balance was adversely impacted, evidenced by significant decreases in VTC and increases in COP path length/area measures, as the ground compliance increased and/or in the absence of vision (ps < 0.001). Interaction effects between environment and vision were observed in VTC and COP path length measures (ps ≤ 0.05), but not COP area (p = 0.103). The estimated likelihood of falls in anterior-posterior (AP) and medio-lateral (ML) directions converged to nearly 50% (almost independent of the foot setting) as the experimental condition became significantly challenging. The second study introduces a deep learning approach using convolutional neural network (CNN) for predicting environments based on instant observations of sway during balance tasks. COP data were collected from fourteen subjects while standing on the 2D compliant environments. Different window sizes for data segmentation were examined to identify its minimal length for reliable prediction. Commonly-used machine learning models were also tested to compare their effectiveness with that of the presented CNN model. The CNN achieved above 94.5% in the overall prediction accuracy even with 2.5-second length data, which cannot be achieved by traditional machine learning models (ps < 0.05). Increasing data length beyond 2.5 seconds slightly improved the accuracy of CNN but substantially increased training time (60% longer). Importantly, averaged normalized confusion matrices revealed that CNN is much more capable of differentiating the mid-level environmental condition. These two studies provide new perspectives in human postural balance, which cannot be interpreted by conventional stability analyses. Outcomes of these studies contribute to the advancement of human interactive robots/devices for fall prevention and rehabilitation.

ContributorsPhan, Vu Nguyen (Author) / Lee, Hyunglae (Thesis advisor) / Peterson, Daniel (Committee member) / Marvi, Hamidreza (Committee member) / Arizona State University (Publisher)

Created2021

Advancing Precision in Medical Diagnostics using AI Expert-Guided Transformers for Enhanced Accuracy

Description

In the realm of medical diagnostics, achieving heightened accuracy is paramount, leading to the meticulous refinement of AI Models through expert-guided tuning aiming to bolster the precision by ensuring their adaptability to complex datasets and optimizing outcomes across various healthcare sectors. By incorporating expert knowledge into the fine-tuning process, these…

In the realm of medical diagnostics, achieving heightened accuracy is paramount, leading to the meticulous refinement of AI Models through expert-guided tuning aiming to bolster the precision by ensuring their adaptability to complex datasets and optimizing outcomes across various healthcare sectors. By incorporating expert knowledge into the fine-tuning process, these advanced models become proficient at navigating the intricacies of medical data, resulting in more precise and dependable diagnostic predictions. As healthcare practitioners grapple with challenges presented by conditions requiring heightened sensitivity, such as cardiovascular diseases, continuous blood glucose monitoring, the application of nuanced refinement in Transformer Models becomes indispensable. Temporal data, a common feature in medical diagnostics, presents unique challenges for Transformer Models characterized by sequential observations over time, requiring models to capture intricate temporal dependencies and complex patterns effectively. In the study, two pivotal healthcare scenarios are delved into: the detection of Coronary Artery Disease (CAD) using Stress ECGs and the identification of psychological stress using Continuous Glucose Monitoring (CGM) data. The CAD dataset was obtained from the Mayo Clinic Integrated Stress Center (MISC) database, which encompassed 100,000 Exercise Stress ECG signals (n=1200), sourced from multiple Mayo Clinic facilities. For the CGM scenario, expert knowledge was utilized to generate synthetic data using the Bergman minimal model, which was then fed to the transformers for classification. Implementation in the CAD example yielded a remarkable 28% Positive Predictive Value (PPV) improvement over the current state-of-the-art, reaching an impressive 91.2%. This significant enhancement demonstrates the efficacy of the approach in enhancing diagnostic accuracy and underscores the transformative impact of expert-guided fine-tuning in medical diagnostics.

ContributorsSalian, Riya (Author) / Gupta, Sandeep (Thesis advisor) / Banerjee, Ayan (Committee member) / Komandoor, Srivathsan (Committee member) / Arizona State University (Publisher)

Created2024

Deep Learning-Based Monocular SLAM

Description

SLAM (Simultaneous Localization and Mapping) is a problem that has existed for a long time in robotics and autonomous navigation. The objective of SLAM is for a robot to simultaneously figure out its position in space and map its environment. SLAM is especially useful and mandatory for robots that want…

SLAM (Simultaneous Localization and Mapping) is a problem that has existed for a long time in robotics and autonomous navigation. The objective of SLAM is for a robot to simultaneously figure out its position in space and map its environment. SLAM is especially useful and mandatory for robots that want to navigate autonomously. The description might make it seem like a chicken and egg problem, but numerous methods have been proposed to tackle SLAM. Before the rise in the popularity of deep learning and AI (Artificial Intelligence), most existing algorithms involved traditional hard-coded algorithms that would receive and process sensor information and convert it into some solvable sensor-agnostic problem. The challenge for these sorts of methods is having to tackle dynamic environments. The more variety in the environment, the poorer the results. Also due to the increase in computational power and the capability of deep learning-based image processing, visual SLAM has become extremely viable and maybe even preferable to traditional SLAM algorithms. In this research, a deep learning-based solution to the SLAM problem is proposed, specifically monocular visual SLAM which is solving the problem of SLAM purely with a singular camera as the input, and the model is tested on the KITTI (Karlsruhe Institute of Technology & Toyota Technological Institute) odometry dataset.

ContributorsRupaakula, Krishna Sandeep (Author) / Bansal, Ajay (Thesis advisor) / Baron, Tyler (Committee member) / Acuna, Ruben (Committee member) / Arizona State University (Publisher)

Created2023

Filtering by