Search Content

Augmenting Academic Research Search And Reading With Richer Context

Description

The volume of scientific research is growing at an exponential rate over the past100 years. With the advent of the internet and ubiquitous access to the web, academic research search engines such as Google Scholar, Microsoft Academic, etc., have become the go-to platforms for systemic reviews and search. Although many…

The volume of scientific research is growing at an exponential rate over the past100 years. With the advent of the internet and ubiquitous access to the web, academic research search engines such as Google Scholar, Microsoft Academic, etc., have become the go-to platforms for systemic reviews and search. Although many academic search engines host lots of content, they provide minimal context about where the search terms matched. Many of these search engines also fail to provide additional tools which can help enhance a researcher’s understanding of research content outside their respective websites. An example of such a tool can be a browser extension/plugin that surfaces context-relevant information about a research article when the user reads a research article. This dissertation discusses a solution developed to bring more intrinsic characteristics of research documents such as the structure of the research document, tables in the document, the keywords associated with the document to improve search capabilities and augment the information a researcher may read. The prototype solution named Sci-Genie(https://sci-genie.com/) is a search engine over scientific articles from Computer Science ArXiv. Sci-Genie parses research papers and indexes research documents’ structure to provide context-relevant information about the matched search fragments. The same search engine also powers a browser extension to augment the information about a research article the user may be reading. The browser extension augments the user’s interface with information about tables from the cited papers, other papers by the same authors, and even the citations to and from the current article. The browser extension is further powered with access endpoints that leverage a machine learning model to filter tables comparing various entities. The dissertation further discusses these machine learning models and some baselines that help classify whether a table is comparing various entities or not. The dissertation finally concludes by discussing the current shortcomings of Sci-Genie and possible future research scope based on learnings after building Sci-Genie.

ContributorsDave, Valay (Author) / Zou, Jia (Thesis advisor) / Ben Amor, Heni (Thesis advisor) / Candan, Kasim Selcuk (Committee member) / Arizona State University (Publisher)

Created2021

SATLAB - An End to End framework for Labelling Satellite Images

Description

In this work, I propose a novel, unsupervised framework titled SATLAB, to label satellite images, given a classification task at hand. Existing models for satellite image classification such as DeepSAT and DeepSAT-V2 rely on deep learning models that are label-hungry and require a significant amount of training data. Since manual…

In this work, I propose a novel, unsupervised framework titled SATLAB, to label satellite images, given a classification task at hand. Existing models for satellite image classification such as DeepSAT and DeepSAT-V2 rely on deep learning models that are label-hungry and require a significant amount of training data. Since manual curation of labels is expensive, I ensure that SATLAB requires zero training labels. SATLAB can work in conjunction with several generative and unsupervised machine learning models by allowing them to be seamlessly plugged into its architecture. I devise three operating modes for SATLAB - manual, semi-automatic and automatic which require varying levels of human intervention in creating the domain-specific labeling functions for each image that can be utilized by the candidate generative models such as Snorkel, as well as other unsupervised learners in SATLAB. Unlike existing supervised learning baselines which only extract textural features from satellite images, I support the extraction of both textural and geospatial features in SATLAB, and I empirically show that geospatial features enhance the classification F1-score by 33%. I build SATLAB on the top of Apache Sedona in order to leverage its rich set of spatial query processing operators for the extraction of geospatial features from satellite raster images. I evaluate SATLAB on a target binary classification task that distinguishes slum from non-slum areas, upon a repository of 100K satellite images captured by the Sentinel satellite program. My 5-Fold Cross Validation (CV) experiments show that SATLAB achieves competitive F1-scores (0.6) using 0% labeled data while the best baseline supervised learning baseline achieves 0.74 F1-score using 80% labeled data. I also show that Snorkel outperforms alternative generative and unsupervised candidate models that can be plugged into SATLAB by 33% to 71% w.r.t. F1-score and 3 times to 73 times w.r.t. latency. I also show that downstream classifiers trained using the labels generated by SATLAB are comparable in quality (0.63 F1) to their counterpart classifiers (0.74 F1) trained on manually curated labels.

ContributorsAggarwal, Shantanu (Author) / Sarwat, Mohamed (Thesis advisor) / Zou, Jia (Committee member) / Boscovic, Dragan (Committee member) / Arizona State University (Publisher)

Created2022

Examining Data Integration with Schema Changes Based on Cell-level Mapping Using Deep Learning Models

Description

Artificial Intelligence, as the hottest research topic nowadays, is mostly driven by data. There is no doubt that data is the king in the age of AI. However, natural high-quality data is precious and rare. In order to obtain enough and eligible data to support AI tasks, data processing is…

Artificial Intelligence, as the hottest research topic nowadays, is mostly driven by data. There is no doubt that data is the king in the age of AI. However, natural high-quality data is precious and rare. In order to obtain enough and eligible data to support AI tasks, data processing is always required. To be even worse, the data preprocessing tasks are often dull and heavy, which require huge human labors to deal with. Statistics show 70% - 80% of the data scientists' time is spent on data integration process. Among various reasons, schema changes that commonly exist in the data warehouse are one significant obstacle that impedes the automation of the end-to-end data integration process. Traditional data integration applications rely on data processing operators such as join, union, aggregation and so on. Those operations are fragile and can be easily interrupted by schema changes. Whenever schema changes happen, the data integration applications will require human labors to solve the interruptions and downtime. The industries as well as the data scientists need a new mechanism to handle the schema changes in data integration tasks. This work proposes a new direction of data integration applications based on deep learning models. The data integration problem is defined in the scenario of integrating tabular-format data with natural schema changes, using the cell-based data abstraction. In addition, data augmentation and adversarial learning are investigated to boost the model robustness to schema changes. The experiments are tested on two real-world data integration scenarios, and the results demonstrate the effectiveness of the proposed approach.

ContributorsWang, Zijie (Author) / Zou, Jia (Thesis advisor) / Baral, Chitta (Committee member) / Candan, K. Selcuk (Committee member) / Arizona State University (Publisher)

Created2021

A Hybrid Cloud Kubernetes Scheduler for Machine Learning Workloads

Description

Demand for processing machine learning workloads has grown incredibly over the past few years. Kubernetes, an open-source container orchestrator, has been widely used by public and private cloud providers for building scalable systems for meeting this demand. The data used to train machine learning workloads can be sensitive in nature,…

Demand for processing machine learning workloads has grown incredibly over the past few years. Kubernetes, an open-source container orchestrator, has been widely used by public and private cloud providers for building scalable systems for meeting this demand. The data used to train machine learning workloads can be sensitive in nature, and organizations may prefer to be responsible for their data security and governance by housing it on on-premises systems. Hybrid cloud gives organizations the flexibility to use both on-premises and cloud infrastructure together, leveraging the advantages of both. While there is a long list of benefits, Kubernetes has limitations by design that limit a user’s abilities in a hybrid cloud environment. The Kubernetes control plane does not allow for the management of worker nodes across cloud providers. This boundary puts new responsibilities on the end-user when deploying a hybrid cloud workload. The end-user must create their clusters and specify which cluster the workload will be scheduled to ahead of time. The Kubernetes scheduler will not take the capacity of another cluster into account. To address these limitations, this thesis presents a new hybrid cloud Kubernetes scheduler that can create new clusters on-demand and burst machine learning workloads to a public cloud when on-premises resources are insufficient. Workloads begin scheduling on an on-premises Kubernetes cluster. When the on-premises cluster’s capacity is exhausted, a new Kubernetes cluster is created on-demand in a public cloud provider, and machine learning tasks waiting in the Kubernetes scheduling queue are dynamically migrated to the public cloud provider’s Kubernetes cluster. The public Kubernetes cluster is dynamically sized and auto scaled based on the pending tasks’ demand. When migrating tasks, the data dependencies among tasks are considered, and a region is dynamically chosen to reduce migration time and cost. The scheduler is experimentally evaluated with real-world machine learning workloads, including predicting if a subscriber will stay with a subscription service, predicting the discount needed to retain a subscription customer, predicting if a credit card transaction is fraudulent, and simulated real-world job arrival behavior in a real hybrid cloud environment. Results show that the scheduler can substantially reduce the workload execution time by dynamically migrating tasks from on-premises to public cloud and minimizing the cost by dynamically sizing and scaling the public cluster.

ContributorsKieley, James (Author) / Zhao, Ming (Thesis advisor) / Huang, Dijiang (Committee member) / Zou, Jia (Committee member) / Arizona State University (Publisher)

Created2021

Exploration of Edge Machine Learning-based Stress Detection Using Wearable Devices

Description

Stress is one of the critical factors in daily lives, as it has a profound impact onperformance at work and decision-making processes. With the development of IoT technology, smart wearables can handle diverse operations, including networking and recording biometric signals. Also, it has become easier for individual users to selfdetect stress with…

Stress is one of the critical factors in daily lives, as it has a profound impact onperformance at work and decision-making processes. With the development of IoT technology, smart wearables can handle diverse operations, including networking and recording biometric signals. Also, it has become easier for individual users to selfdetect stress with recorded data since these wearables as well as their accompanying smartphones now have data processing capability. Edge computing on such devices enables real-time feedback and in turn preemptive identification of reactions to stress. This can provide an opportunity to prevent more severe consequences that might result if stress is unaddressed. From a system perspective, leveraging edge computing allows saving energy such as network bandwidth and latency since it processes data in proximity to the data source. It can also strengthen privacy by implementing stress prediction at local devices without transferring personal information to the public cloud. This thesis presents a framework for real-time stress prediction using Fitbit and machine learning with the support from cloud computing. Fitbit is a wearable tracker that records biometric measurements using optical sensors on the wrist. It also provides developers with platforms to design custom applications. I developed an application for the Fitbit and the user’s accompanying mobile device to collect heart rate fluctuations and corresponding stress levels entered by users. I also established the dataset collected from police cadets during their academy training program. Machine learning classifiers for stress prediction are built using classic models and TensorFlow in the cloud. Lastly, the classifiers are optimized using model compression techniques for deploying them on the smartphones and analyzed how efficiently stress prediction can be performed on the edge.

ContributorsSim, Sang-Hun (Author) / Zhao, Ming (Thesis advisor) / Roberts, Nicole (Committee member) / Zou, Jia (Committee member) / Arizona State University (Publisher)

Created2022

Improving and Automating Machine Learning Model Compression

Description

Machine learning models are increasingly employed by smart devices on the edge to support important applications such as real-time virtual assistants and privacy-preserving healthcare. However, deploying state-of-the-art (SOTA) deep learning models on devices faces multiple serious challenges. First, it is infeasible to deploy large models on resource-constrained edge devices whereas…

Machine learning models are increasingly employed by smart devices on the edge to support important applications such as real-time virtual assistants and privacy-preserving healthcare. However, deploying state-of-the-art (SOTA) deep learning models on devices faces multiple serious challenges. First, it is infeasible to deploy large models on resource-constrained edge devices whereas small models cannot achieve the SOTA accuracy. Second, it is difficult to customize the models according to diverse application requirements in accuracy and speed and diverse capabilities of edge devices. This study proposes several novel solutions to comprehensively address the above challenges through automated and improved model compression. First, it introduces Automatic Attention Pruning (AAP), an adaptive, attention-based pruning approach to automatically reduce model parameters while meeting diverse user objectives in model size, speed, and accuracy. AAP achieves an impressive 92.72% parameter reduction in ResNet-101 on Tiny-ImageNet without causing any accuracy loss. Second, it presents Self-Supervised Quantization-Aware Knowledge Distillation (SQAKD), a framework for reducing model precision without supervision from labeled training data. For example, it quantizes VGG-8 to 2 bits on CIFAR-10 without any accuracy loss. Finally, the study explores two more works, Contrastive Knowledge Distillation Framework (CKDF) and Log-Curriculum based Module Replacing (LCMR), for further improving the performance of small models. All the works proposed in this study are designed to address real-world challenges, and have been successfully deployed on diverse hardware platforms, including cloud instances and edge devices, catalyzing AI for the edge.

ContributorsZhao, Kaiqi (Author) / Zhao, Ming (Thesis advisor) / Li, Baoxin (Committee member) / Zou, Jia (Committee member) / Yang, Yingzhen (Committee member) / Arizona State University (Publisher)

Created2024

Compression and Regularization of Vision Transformers

Description

Vision Transformers (ViT) achieve state-of-the-art performance on image classification tasks. However, their massive size makes them unsuitable for edge devices. Unlike CNNs, limited research has been conducted on the compression of ViTs. This thesis work proposes the ”adjoined training technique” to compress any transformer based architecture. The architecture, Adjoined Vision…

Vision Transformers (ViT) achieve state-of-the-art performance on image classification tasks. However, their massive size makes them unsuitable for edge devices. Unlike CNNs, limited research has been conducted on the compression of ViTs. This thesis work proposes the ”adjoined training technique” to compress any transformer based architecture. The architecture, Adjoined Vision Transformer (AN-ViT), achieves state-of-the-art performance on the ImageNet classification task. With the base network as Swin Transformer, AN-ViT with 4.1× fewer parameters and 5.5× fewer floating point operations (FLOPs) achieves similar accuracy (within 0.15%). This work further proposes Differentiable Adjoined ViT (DAN-ViT), whichuses neural architecture search to find hyper-parameters of our model. DAN-ViT outperforms the current state-of-the-art methods including Swin-Transformers by about ∼ 0.07% and achieves 85.27% top-1 accuracy on the ImageNet dataset while using 2.2× fewer parameters and with 2.2× fewer FLOPs.

ContributorsGoel, Rajeev (Author) / Yang, Yingzhen (Thesis advisor) / Yang, Yezhou (Committee member) / Zou, Jia (Committee member) / Arizona State University (Publisher)

Created2023

A Streamlined Pipeline to Generate Synthetic Identity Documents

Description

In contemporary society, the proliferation of fake identity documents presents a profound menace that permeates various facets of the social fabric. The advent of artificial intelligence coupled with sophisticated printing techniques has significantly exacerbated this issue. The ramifications of counterfeit identity documents extend far beyond the legal infractions and financial…

In contemporary society, the proliferation of fake identity documents presents a profound menace that permeates various facets of the social fabric. The advent of artificial intelligence coupled with sophisticated printing techniques has significantly exacerbated this issue. The ramifications of counterfeit identity documents extend far beyond the legal infractions and financial losses incurred by victims of identity theft because they pose a severe threat to public safety, national security, and societal trust. Given these multifaceted threats, the imperative to detect and thwart fraud identity documents has become paramount. The efficacy of fraud detection tools is contingent upon the availability of extensive identity document datasets for training purposes. However, existing benchmark datasets such as MIDV-500, MIDV-2020, and FMIDV exhibit notable deficiencies such as a limited number of samples, insufficient coverage of various fraud patterns, and occasional alterations in critical personal identifier fields, particularly portrait images. These limitations constrain their effectiveness in training models capable of detecting realistic fraud instances while also safeguarding privacy. This thesis delineates the research work to address this gap by proposing a streamlined pipeline for generating synthetic identity documents and introducing the resultant benchmark dataset, named IDNet. IDNet is meticulously crafted to propel advancements in privacy-preserving fraud detection initiatives and comprises 597,900 images of synthetically generated identity documents, amounting to approximately 350 gigabytes of data. These documents are categorized into 20 types, encompassing identity documents from 10 U.S. states and 10 European countries. Additionally, the dataset includes identity documents consisting of either a single fraud pattern or multiple fraud patterns, to cater to various model training requirements.

ContributorsNag, Soham (Author) / Zou, Jia (Thesis advisor) / Yang, Yingzhen (Committee member) / Zhang, Yu (Committee member) / Arizona State University (Publisher)

Created2024

Accelerating Deep Learning Inference in Relational Database Systems

Description

Deep learning has become a potent method for drawing conclusions and forecasts from massive amounts of data. But when used in practical applications, conventional deep learning frameworks frequently run into problems, especially when data is stored in relational database systems. Thus, in recent years, a stream of research in integrating…

Deep learning has become a potent method for drawing conclusions and forecasts from massive amounts of data. But when used in practical applications, conventional deep learning frameworks frequently run into problems, especially when data is stored in relational database systems. Thus, in recent years, a stream of research in integrating machine learning model inferences with a relational database to achieve benefits such as avoiding privacy issues and data transfer overheads is observed. The logic for performing the inference using the DNN model can be encapsulated in a user-defined function (UDF). These UDFs can then be integrated with the query interface of the DBMS and executed by the query execution engine. While it is relatively straightforward to leverage the User Defined Functions (UDFs) to implement machine learning algorithms using parallelism, it is observed that such implementations will not always be optimal and may incur issues in balancing the database threading and the threading of the libraries that the UDFs invoke. Since relational databases provide native support for relational operators, it is possible to leverage a cost model to make decisions for selectively transforming the UDFs based inference logic into a model-parallel implementation for optimal performance. Thus, this thesis will focus on the following: 1. Designing a domain-specific language for implementing the UDFs using Velox library, which can be lowered to a graph-based intermediate representation (IR); 2. Providing a cost model that aids in the decision-making of converting a UDF-centric implementation to a relation centric one.

ContributorsMasood, Saif (Author) / Zou, Jia (Thesis advisor) / Xiao, Xusheng (Committee member) / Yang, Yingzhen (Committee member) / Arizona State University (Publisher)

Created2024

Metadata Supported Multi-Variate Multi-Scale Attention for Onset Detection and Prediction

Description

Multivariate timeseries data are highly common in the healthcare domain, especially in the neuroscience field for detecting and predicting seizures to monitoring intracranial hypertension (ICH). Unfortunately, conventional techniques to leverage the available time series data do not provide high degrees of accuracy. To address this challenge, the dissertation focuses on…

Multivariate timeseries data are highly common in the healthcare domain, especially in the neuroscience field for detecting and predicting seizures to monitoring intracranial hypertension (ICH). Unfortunately, conventional techniques to leverage the available time series data do not provide high degrees of accuracy. To address this challenge, the dissertation focuses on onset prediction models for children with brain trauma in collaboration with neurologists at Phoenix Children’s Hospital. The dissertation builds on the key hypothesis that leveraging spatial information underlying the electroencephalogram (EEG) sensor graphs can significantly boost the accuracy in a multi-modal environment, integrating EEG with intracranial pressure (ICP), arterial blood pressure (ABP) and electrocardiogram (ECG) modalities. Based on this key hypothesis, the dissertation focuses on novel metadata supported multi-variate time series analysis algorithms for onset detection and prediction. In particular, the dissertation investigates a model architecture with a dual attention mechanism to draw global dependencies between inputs and outputs, leveraging self-attention in EEG data using multi-head attention for transformers, and long short-term memory (LSTM). However, recognizing that the positional encoding used traditionally in transformers does not help capture the spatial/neighborhood context of EEG sensors, the dissertation investigates novel attention techniques for performing explicit spatial learning using a coupled model network. This dissertation has answered the question of leveraging transformers and LSTM to perform implicit and explicit learning using a metadata supported coupled model network a) Robust Multi-variate Temporal Features (RMT) model and LSTM, b) the convolutional neural network - scale space attention (CNN-SSA) and LSTM mapped together using Multi-Head Attention with explicit spatial metadata for EEG sensor graphs for seizure and ICH onset prediction respectively. In addition, this dissertation focuses on transfer learning between multiple groups where target patients have lesser number of EEG channels than the source patients. This incomplete data poses problems during pre-processing. Two approaches are explored using all predictors approach considering spatial context to guide the variates who are used as predictors for the missing EEG channels, and common core/subset of EEG channels. Under data imputation K-Nearest Neighbors (KNN) regression and multi-variate multi-scale neural network (M2NN) are implemented, to address the problem for target patients.

ContributorsRavindranath, Manjusha (Author) / Candan, K. Selcuk (Thesis advisor) / Davulcu, Hasan (Committee member) / Zou, Jia (Committee member) / Luisa Sapino, Maria (Committee member) / Arizona State University (Publisher)

Created2024