Search Content

Sparse learning package with stability selection and application to alzheimer's disease

Description

Sparse learning is a technique in machine learning for feature selection and dimensionality reduction, to find a sparse set of the most relevant features. In any machine learning problem, there is a considerable amount of irrelevant information, and separating relevant information from the irrelevant information has been a topic of…

Sparse learning is a technique in machine learning for feature selection and dimensionality reduction, to find a sparse set of the most relevant features. In any machine learning problem, there is a considerable amount of irrelevant information, and separating relevant information from the irrelevant information has been a topic of focus. In supervised learning like regression, the data consists of many features and only a subset of the features may be responsible for the result. Also, the features might require special structural requirements, which introduces additional complexity for feature selection. The sparse learning package, provides a set of algorithms for learning a sparse set of the most relevant features for both regression and classification problems. Structural dependencies among features which introduce additional requirements are also provided as part of the package. The features may be grouped together, and there may exist hierarchies and over- lapping groups among these, and there may be requirements for selecting the most relevant groups among them. In spite of getting sparse solutions, the solutions are not guaranteed to be robust. For the selection to be robust, there are certain techniques which provide theoretical justification of why certain features are selected. The stability selection, is a method for feature selection which allows the use of existing sparse learning methods to select the stable set of features for a given training sample. This is done by assigning probabilities for the features: by sub-sampling the training data and using a specific sparse learning technique to learn the relevant features, and repeating this a large number of times, and counting the probability as the number of times a feature is selected. Cross-validation which is used to determine the best parameter value over a range of values, further allows to select the best parameter value. This is done by selecting the parameter value which gives the maximum accuracy score. With such a combination of algorithms, with good convergence guarantees, stable feature selection properties and the inclusion of various structural dependencies among features, the sparse learning package will be a powerful tool for machine learning research. Modular structure, C implementation, ATLAS integration for fast linear algebraic subroutines, make it one of the best tool for a large sparse setting. The varied collection of algorithms, support for group sparsity, batch algorithms, are a few of the notable functionality of the SLEP package, and these features can be used in a variety of fields to infer relevant elements. The Alzheimer Disease(AD) is a neurodegenerative disease, which gradually leads to dementia. The SLEP package is used for feature selection for getting the most relevant biomarkers from the available AD dataset, and the results show that, indeed, only a subset of the features are required to gain valuable insights.

ContributorsThulasiram, Ramesh (Author) / Ye, Jieping (Thesis advisor) / Xue, Guoliang (Committee member) / Sen, Arunabha (Committee member) / Arizona State University (Publisher)

Created2011

Expanding data mining theory for industrial applications

Description

The field of Data Mining is widely recognized and accepted for its applications in many business problems to guide decision-making processes based on data. However, in recent times, the scope of these problems has swollen and the methods are under scrutiny for applicability and relevance to real-world circumstances. At the…

The field of Data Mining is widely recognized and accepted for its applications in many business problems to guide decision-making processes based on data. However, in recent times, the scope of these problems has swollen and the methods are under scrutiny for applicability and relevance to real-world circumstances. At the crossroads of innovation and standards, it is important to examine and understand whether the current theoretical methods for industrial applications (which include KDD, SEMMA and CRISP-DM) encompass all possible scenarios that could arise in practical situations. Do the methods require changes or enhancements? As part of the thesis I study the current methods and delineate the ideas of these methods and illuminate their shortcomings which posed challenges during practical implementation. Based on the experiments conducted and the research carried out, I propose an approach which illustrates the business problems with higher accuracy and provides a broader view of the process. It is then applied to different case studies highlighting the different aspects to this approach.

ContributorsAnand, Aneeth (Author) / Liu, Huan (Thesis advisor) / Kempf, Karl G. (Thesis advisor) / Sen, Arunabha (Committee member) / Arizona State University (Publisher)

Created2012

Characterization of cost excess in cloud applications

Description

The pay-as-you-go economic model of cloud computing increases the visibility, traceability, and verifiability of software costs. Application developers must understand how their software uses resources when running in the cloud in order to stay within budgeted costs and/or produce expected profits. Cloud computing's unique economic model also leads naturally to…

The pay-as-you-go economic model of cloud computing increases the visibility, traceability, and verifiability of software costs. Application developers must understand how their software uses resources when running in the cloud in order to stay within budgeted costs and/or produce expected profits. Cloud computing's unique economic model also leads naturally to an earn-as-you-go profit model for many cloud based applications. These applications can benefit from low level analyses for cost optimization and verification. Testing cloud applications to ensure they meet monetary cost objectives has not been well explored in the current literature. When considering revenues and costs for cloud applications, the resource economic model can be scaled down to the transaction level in order to associate source code with costs incurred while running in the cloud. Both static and dynamic analysis techniques can be developed and applied to understand how and where cloud applications incur costs. Such analyses can help optimize (i.e. minimize) costs and verify that they stay within expected tolerances. An adaptation of Worst Case Execution Time (WCET) analysis is presented here to statically determine worst case monetary costs of cloud applications. This analysis is used to produce an algorithm for determining control flow paths within an application that can exceed a given cost threshold. The corresponding results are used to identify path sections that contribute most to cost excess. A hybrid approach for determining cost excesses is also presented that is comprised mostly of dynamic measurements but that also incorporates calculations that are based on the static analysis approach. This approach uses operational profiles to increase the precision and usefulness of the calculations.

ContributorsBuell, Kevin, Ph.D (Author) / Collofello, James (Thesis advisor) / Davulcu, Hasan (Committee member) / Lindquist, Timothy (Committee member) / Sen, Arunabha (Committee member) / Arizona State University (Publisher)

Created2012

Listing combinatorial objects

Description

Gray codes are perhaps the best known structures for listing sequences of combinatorial objects, such as binary strings. Simply defined as a minimal change listing, Gray codes vary greatly both in structure and in the types of objects that they list. More specific types of Gray codes are universal cycles…

Gray codes are perhaps the best known structures for listing sequences of combinatorial objects, such as binary strings. Simply defined as a minimal change listing, Gray codes vary greatly both in structure and in the types of objects that they list. More specific types of Gray codes are universal cycles and overlap sequences. Universal cycles are Gray codes on a set of strings of length n in which the first n-1 letters of one object are the same as the last n-1 letters of its predecessor in the listing. Overlap sequences allow this overlap to vary between 1 and n-1. Some of our main contributions to the areas of Gray codes and universal cycles include a new Gray code algorithm for fixed weight m-ary words, and results on the existence of universal cycles for weak orders on [n]. Overlap cycles are a relatively new structure with very few published results. We prove the existence of s-overlap cycles for k-permutations of [n], which has been an open research problem for several years, as well as constructing 1- overlap cycles for Steiner triple and quadruple systems of every order. Also included are various other results of a similar nature covering other structures such as binary strings, m-ary strings, subsets, permutations, weak orders, partitions, and designs. These listing structures lend themselves readily to some classes of combinatorial objects, such as binary n-tuples and m-ary n-tuples. Others require more work to find an appropriate structure, such as k-subsets of an n-set, weak orders, and designs. Still more require a modification in the representation of the objects to fit these structures, such as partitions. Determining when and how we can fit these sets of objects into our three listing structures is the focus of this dissertation.

ContributorsHoran, Victoria E (Author) / Hurlbert, Glenn H. (Thesis advisor) / Czygrinow, Andrzej (Committee member) / Fishel, Susanna (Committee member) / Colbourn, Charles (Committee member) / Sen, Arunabha (Committee member) / Arizona State University (Publisher)

Created2012

Collaborative Computation in Self-Organizing Particle Systems

Description

Many forms of programmable matter have been proposed for various tasks. We use an abstract model of self-organizing particle systems for programmable matter which could be used for a variety of applications, including smart paint and coating materials for engineering or programmable cells for medical uses. Previous research using this…

Many forms of programmable matter have been proposed for various tasks. We use an abstract model of self-organizing particle systems for programmable matter which could be used for a variety of applications, including smart paint and coating materials for engineering or programmable cells for medical uses. Previous research using this model has focused on shape formation and other spatial configuration problems, including line formation, compression, and coating. In this work we study foundational computational tasks that exceed the capabilities of the individual constant memory particles described by the model. These tasks represent new ways to use these self-organizing systems, which, in conjunction with previous shape and configuration work, make the systems useful for a wider variety of tasks. We present an implementation of a counter using a line of particles, which makes it possible for the line of particles to count to and store values much larger than their individual capacities. We then present an algorithm that takes a matrix and a vector as input and then sets up and uses a rectangular block of particles to compute the matrix-vector multiplication. This setup also utilizes the counter implementation to store the resulting vector from the matrix-vector multiplication. Operations such as counting and matrix multiplication can leverage the distributed and dynamic nature of the self-organizing system to be more efficient and adaptable than on traditional linear computing hardware. Such computational tools also give the systems more power to make complex decisions when adapting to new situations or to analyze the data they collect, reducing reliance on a central controller for setup and output processing. Finally, we demonstrate an application of similar types of computations with self-organizing systems to image processing, with an implementation of an image edge detection algorithm.

ContributorsPorter, Alexandra Marie (Author) / Richa, Andrea (Thesis director) / Xue, Guoliang (Committee member) / School of Music (Contributor) / Computer Science and Engineering Program (Contributor) / School of Mathematical and Statistical Sciences (Contributor) / Barrett, The Honors College (Contributor)

Created2016-12

Learning Causality with Networked Observational Data

Description

This dissertation considers the question of how convenient access to copious networked observational data impacts our ability to learn causal knowledge. It investigates in what ways learning causality from such data is different from -- or the same as -- the traditional causal inference which often deals with small scale…

This dissertation considers the question of how convenient access to copious networked observational data impacts our ability to learn causal knowledge. It investigates in what ways learning causality from such data is different from -- or the same as -- the traditional causal inference which often deals with small scale i.i.d. data collected from randomized controlled trials? For example, how can we exploit network information for a series of tasks in the area of learning causality? To answer this question, the dissertation is written toward developing a suite of novel causal learning algorithms that offer actionable insights for a series of causal inference tasks with networked observational data. The work aims to benefit real-world decision-making across a variety of highly influential applications. In the first part of this dissertation, it investigates the task of inferring individual-level causal effects from networked observational data. First, it presents a representation balancing-based framework for handling the influence of hidden confounders to achieve accurate estimates of causal effects. Second, it extends the framework with an adversarial learning approach to properly combine two types of existing heuristics: representation balancing and treatment prediction. The second part of the dissertation describes a framework for counterfactual evaluation of treatment assignment policies with networked observational data. A novel framework that captures patterns of hidden confounders is developed to provide more informative input for downstream counterfactual evaluation methods. The third part presents a framework for debiasing two-dimensional grid-based e-commerce search with observational search log data where there is an implicit network connecting neighboring products in a search result page. A novel inverse propensity scoring framework that models user behavior patterns for two-dimensional display in e-commerce websites is developed, which aims to optimize online performance of ranking algorithms with offline log data.

ContributorsGuo, Ruocheng (Author) / Liu, Huan (Thesis advisor) / Candan, K. Selcuk (Committee member) / Xue, Guoliang (Committee member) / Kiciman, Emre (Committee member) / Arizona State University (Publisher)

Created2021

Heuristics for Arc Routing Problems and Their Applications

Description

Arc Routing Problems (ARPs) are a type of routing problem that finds routes of minimum total cost covering the edges or arcs in a graph representing street or road networks. They find application in many essential services such as residential waste collection, winter gritting, and others. Being NP-hard, solutions are…

Arc Routing Problems (ARPs) are a type of routing problem that finds routes of minimum total cost covering the edges or arcs in a graph representing street or road networks. They find application in many essential services such as residential waste collection, winter gritting, and others. Being NP-hard, solutions are usually found using heuristic methods. This dissertation contributes to heuristics for ARP, with a focus on the Capacitated Arc Routing Problem (CARP) with additional constraints. In operations such as residential waste collection, vehicle breakdown disruptions occur frequently. A new variant Capacitated Arc Re-routing Problem for Vehicle Break-down (CARP-VB) is introduced to address the need to re-route using only remaining vehicles to avoid missing services. A new heuristic Probe is developed to solve CARP-VB. Experiments on benchmark instances show that Probe is better in reducing the makespan and hence effective in reducing delays and avoiding missing services. In addition to total cost, operators are also interested in solutions that are attractive, that is, routes that are contiguous, compact, and non-overlapping to manage the work. Operators may not adopt a solution that is not attractive even if it is optimum. They are also interested in solutions that are balanced in workload to meet equity requirements. A new multi-objective memetic algorithm, MA-ABC is developed, that optimizes three objectives: Attractiveness, makespan, and total cost. On testing with benchmark instances, MA-ABC was found to be effective in providing attractive and balanced route solutions without affecting the total cost. Changes in the problem specification such as demand and topology occurs frequently in business operations. Machine learning be applied to learn the distribution behind these changes and generate solutions quickly at time of inference. Splice is a machine learning framework for CARP that generates closer to optimum solutions quickly using a graph neural network and deep Q-learning. Splice can solve several variants of node and arc routing problems using the same architecture without any modification. Splice was trained and tested using randomly generated instances. Splice generated solutions faster that are also better in comparison to popular metaheuristics.

ContributorsRamamoorthy, Muhilan (Author) / Syrotiuk, Violet R. (Thesis advisor) / Forrest, Stephanie (Committee member) / Mirchandani, Pitu (Committee member) / Sen, Arunabha (Committee member) / Arizona State University (Publisher)

Created2022

Investigating the Role of Silent Users on Social Media

Description

Social media platforms provide a rich environment for analyzing user behavior. Recently, deep learning-based methods have been a mainstream approach for social media analysis models involving complex patterns. However, these methods are susceptible to biases in the training data, such as participation inequality. Basically, a mere 1% of users generate…

Social media platforms provide a rich environment for analyzing user behavior. Recently, deep learning-based methods have been a mainstream approach for social media analysis models involving complex patterns. However, these methods are susceptible to biases in the training data, such as participation inequality. Basically, a mere 1% of users generate the majority of the content on social networking sites, while the remaining users, though engaged to varying degrees, tend to be less active in content creation and largely silent. These silent users consume and listen to information that is propagated on the platform.However, their voice, attitude, and interests are not reflected in the online content, making the decision of the current methods predisposed towards the opinion of the active users. So models can mistake the loudest users for the majority. To make the silent majority heard is to reveal the true landscape of the platform. In this dissertation, to compensate for this bias in the data, which is related to user-level data scarcity, I introduce three pieces of research work. Two of these proposed solutions deal with the data on hand while the other tries to augment the current data. Specifically, the first proposed approach modifies the weight of users' activity/interaction in the input space, while the second approach involves re-weighting the loss based on the users' activity levels during the downstream task training. Lastly, the third approach uses large language models (LLMs) and learns the user's writing behavior to expand the current data. In other words, by utilizing LLMs as a sophisticated knowledge base, this method aims to augment the silent user's data.

ContributorsKarami, Mansooreh (Author) / Liu, Huan (Thesis advisor) / Sen, Arunabha (Committee member) / Davulcu, Hasan (Committee member) / Mancenido, Michelle V. (Committee member) / Arizona State University (Publisher)

Created2023

Trustworthy IoT Sensing-as-a-Service

Description

The Internet-of-Things (IoT) paradigm is reshaping the ways to interact with the physical space. Many emerging IoT applications need to acquire, process, gain insights from, and act upon the massive amount of data continuously produced by ubiquitous IoT sensors. It is nevertheless technically challenging and economically prohibitive for each IoT…

The Internet-of-Things (IoT) paradigm is reshaping the ways to interact with the physical space. Many emerging IoT applications need to acquire, process, gain insights from, and act upon the massive amount of data continuously produced by ubiquitous IoT sensors. It is nevertheless technically challenging and economically prohibitive for each IoT application to deploy and maintain a dedicated large-scale sensor network over distributed wide geographic areas. Built upon the Sensing-as-a-Service paradigm, cloud-sensing service providers are emerging to provide heterogeneous sensing data to various IoT applications with a shared sensing substrate. Cyber threats are among the biggest obstacles against the faster development of cloud-sensing services. This dissertation presents novel solutions to achieve trustworthy IoT sensing-as-a-service. Chapter 1 introduces the cloud-sensing system architecture and the outline of this dissertation. Chapter 2 presents MagAuth, a secure and usable two-factor authentication scheme that explores commercial off-the-shelf wrist wearables with magnetic strap bands to enhance the security and usability of password-based authentication for touchscreen IoT devices. Chapter 3 presents SmartMagnet, a novel scheme that combines smartphones and cheap magnets to achieve proximity-based access control for IoT devices. Chapter 4 proposes SpecKriging, a new spatial-interpolation technique based on graphic neural networks for secure cooperative spectrum sensing which is an important application of cloud-sensing systems. Chapter 5 proposes a trustworthy multi-transmitter localization scheme based on SpecKriging. Chapter 6 discusses the future work.

ContributorsZhang, Yan (Author) / Zhang, Yanchao YZ (Thesis advisor) / Fan, Deliang (Committee member) / Xue, Guoliang (Committee member) / Reisslein, Martin (Committee member) / Arizona State University (Publisher)

Created2022

Learning from the Data Heterogeneity for Data Imputation

Description

Data mining, also known as big data analysis, has been identified as a critical and challenging process for a variety of applications in real-world problems. Numerous datasets are collected and generated every day to store the information. The rise in the number of data volumes and data modality has resulted…

Data mining, also known as big data analysis, has been identified as a critical and challenging process for a variety of applications in real-world problems. Numerous datasets are collected and generated every day to store the information. The rise in the number of data volumes and data modality has resulted in the increased demand for data mining methods and strategies of finding anomalies, patterns, and correlations within large data sets to predict outcomes. Effective machine learning methods are widely adapted to build the data mining pipeline for various purposes like business understanding, data understanding, data preparation, modeling, evaluation, and deployment. The major challenges for effectively and efficiently mining big data include (1) data heterogeneity and (2) missing data. Heterogeneity is the natural characteristic of big data, as the data is typically collected from different sources with diverse formats. The missing value is the most common issue faced by the heterogeneous data analysis, which resulted from variety of factors including the data collecting processing, user initiatives, erroneous data entries, and so on. In response to these challenges, in this thesis, three main research directions with application scenarios have been investigated: (1) Mining and Formulating Heterogeneous Data, (2) missing value imputation strategy in various application scenarios in both offline and online manner, and (3) missing value imputation for multi-modality data. Multiple strategies with theoretical analysis are presented, and the evaluation of the effectiveness of the proposed algorithms compared with state-of-the-art methods is discussed.

Contributorsliu, Xu (Author) / He, Jingrui (Thesis advisor) / Xue, Guoliang (Thesis advisor) / Li, Baoxin (Committee member) / Tong, Hanghang (Committee member) / Arizona State University (Publisher)

Created2021

Filtering by