Search Content

Sparse learning package with stability selection and application to alzheimer's disease

Description

Sparse learning is a technique in machine learning for feature selection and dimensionality reduction, to find a sparse set of the most relevant features. In any machine learning problem, there is a considerable amount of irrelevant information, and separating relevant information from the irrelevant information has been a topic of…

Sparse learning is a technique in machine learning for feature selection and dimensionality reduction, to find a sparse set of the most relevant features. In any machine learning problem, there is a considerable amount of irrelevant information, and separating relevant information from the irrelevant information has been a topic of focus. In supervised learning like regression, the data consists of many features and only a subset of the features may be responsible for the result. Also, the features might require special structural requirements, which introduces additional complexity for feature selection. The sparse learning package, provides a set of algorithms for learning a sparse set of the most relevant features for both regression and classification problems. Structural dependencies among features which introduce additional requirements are also provided as part of the package. The features may be grouped together, and there may exist hierarchies and over- lapping groups among these, and there may be requirements for selecting the most relevant groups among them. In spite of getting sparse solutions, the solutions are not guaranteed to be robust. For the selection to be robust, there are certain techniques which provide theoretical justification of why certain features are selected. The stability selection, is a method for feature selection which allows the use of existing sparse learning methods to select the stable set of features for a given training sample. This is done by assigning probabilities for the features: by sub-sampling the training data and using a specific sparse learning technique to learn the relevant features, and repeating this a large number of times, and counting the probability as the number of times a feature is selected. Cross-validation which is used to determine the best parameter value over a range of values, further allows to select the best parameter value. This is done by selecting the parameter value which gives the maximum accuracy score. With such a combination of algorithms, with good convergence guarantees, stable feature selection properties and the inclusion of various structural dependencies among features, the sparse learning package will be a powerful tool for machine learning research. Modular structure, C implementation, ATLAS integration for fast linear algebraic subroutines, make it one of the best tool for a large sparse setting. The varied collection of algorithms, support for group sparsity, batch algorithms, are a few of the notable functionality of the SLEP package, and these features can be used in a variety of fields to infer relevant elements. The Alzheimer Disease(AD) is a neurodegenerative disease, which gradually leads to dementia. The SLEP package is used for feature selection for getting the most relevant biomarkers from the available AD dataset, and the results show that, indeed, only a subset of the features are required to gain valuable insights.

ContributorsThulasiram, Ramesh (Author) / Ye, Jieping (Thesis advisor) / Xue, Guoliang (Committee member) / Sen, Arunabha (Committee member) / Arizona State University (Publisher)

Created2011

An investigation of the cost and accuracy tradeoffs of supplanting AFDs with bayes network in query processing in the presence of incompleteness in autonomous databases

Description

As the information available to lay users through autonomous data sources continues to increase, mediators become important to ensure that the wealth of information available is tapped effectively. A key challenge that these information mediators need to handle is the varying levels of incompleteness in the underlying databases in terms…

As the information available to lay users through autonomous data sources continues to increase, mediators become important to ensure that the wealth of information available is tapped effectively. A key challenge that these information mediators need to handle is the varying levels of incompleteness in the underlying databases in terms of missing attribute values. Existing approaches such as Query Processing over Incomplete Autonomous Databases (QPIAD) aim to mine and use Approximate Functional Dependencies (AFDs) to predict and retrieve relevant incomplete tuples. These approaches make independence assumptions about missing values--which critically hobbles their performance when there are tuples containing missing values for multiple correlated attributes. In this thesis, I present a principled probabilis- tic alternative that views an incomplete tuple as defining a distribution over the complete tuples that it stands for. I learn this distribution in terms of Bayes networks. My approach involves min- ing/"learning" Bayes networks from a sample of the database, and using it do both imputation (predict a missing value) and query rewriting (retrieve relevant results with incompleteness on the query-constrained attributes, when the data sources are autonomous). I present empirical studies to demonstrate that (i) at higher levels of incompleteness, when multiple attribute values are missing, Bayes networks do provide a significantly higher classification accuracy and (ii) the relevant possible answers retrieved by the queries reformulated using Bayes networks provide higher precision and recall than AFDs while keeping query processing costs manageable.

ContributorsRaghunathan, Rohit (Author) / Kambhampati, Subbarao (Thesis advisor) / Liu, Huan (Committee member) / Lee, Joohyung (Committee member) / Arizona State University (Publisher)

Created2011

Representing the language of the causal calculator in answer set programming

Description

Action language C+ is a formalism for describing properties of actions, which is based on nonmonotonic causal logic. The definite fragment of C+ is implemented in the Causal Calculator (CCalc), which is based on the reduction of nonmonotonic causal logic to propositional logic. This thesis describes the language…

Action language C+ is a formalism for describing properties of actions, which is based on nonmonotonic causal logic. The definite fragment of C+ is implemented in the Causal Calculator (CCalc), which is based on the reduction of nonmonotonic causal logic to propositional logic. This thesis describes the language of CCalc in terms of answer set programming (ASP), based on the translation of nonmonotonic causal logic to formulas under the stable model semantics. I designed a standard library which describes the constructs of the input language of CCalc in terms of ASP, allowing a simple modular method to represent CCalc input programs in the language of ASP. Using the combination of system F2LP and answer set solvers, this method achieves functionality close to that of CCalc while taking advantage of answer set solvers to yield efficient computation that is orders of magnitude faster than CCalc for many benchmark examples. In support of this, I created an automated translation system Cplus2ASP that implements the translation and encoding method and automatically invokes the necessary software to solve the translated input programs.

ContributorsCasolary, Michael (Author) / Lee, Joohyung (Thesis advisor) / Ahn, Gail-Joon (Committee member) / Baral, Chitta (Committee member) / Arizona State University (Publisher)

Created2011

Multi-task learning via structured regularization: formulations, algorithms, and applications

Description

Multi-task learning (MTL) aims to improve the generalization performance (of the resulting classifiers) by learning multiple related tasks simultaneously. Specifically, MTL exploits the intrinsic task relatedness, based on which the informative domain knowledge from each task can be shared across multiple tasks and thus facilitate the individual task learning. It…

Multi-task learning (MTL) aims to improve the generalization performance (of the resulting classifiers) by learning multiple related tasks simultaneously. Specifically, MTL exploits the intrinsic task relatedness, based on which the informative domain knowledge from each task can be shared across multiple tasks and thus facilitate the individual task learning. It is particularly desirable to share the domain knowledge (among the tasks) when there are a number of related tasks but only limited training data is available for each task. Modeling the relationship of multiple tasks is critical to the generalization performance of the MTL algorithms. In this dissertation, I propose a series of MTL approaches which assume that multiple tasks are intrinsically related via a shared low-dimensional feature space. The proposed MTL approaches are developed to deal with different scenarios and settings; they are respectively formulated as mathematical optimization problems of minimizing the empirical loss regularized by different structures. For all proposed MTL formulations, I develop the associated optimization algorithms to find their globally optimal solution efficiently. I also conduct theoretical analysis for certain MTL approaches by deriving the globally optimal solution recovery condition and the performance bound. To demonstrate the practical performance, I apply the proposed MTL approaches on different real-world applications: (1) Automated annotation of the Drosophila gene expression pattern images; (2) Categorization of the Yahoo web pages. Our experimental results demonstrate the efficiency and effectiveness of the proposed algorithms.

ContributorsChen, Jianhui (Author) / Ye, Jieping (Thesis advisor) / Kumar, Sudhir (Committee member) / Liu, Huan (Committee member) / Xue, Guoliang (Committee member) / Arizona State University (Publisher)

Created2011

Structured sparse learning and its applications to biomedical and biological data

Description

Sparsity has become an important modeling tool in areas such as genetics, signal and audio processing, medical image processing, etc. Via the penalization of l-1 norm based regularization, the structured sparse learning algorithms can produce highly accurate models while imposing various predefined structures on the data, such as feature groups…

Sparsity has become an important modeling tool in areas such as genetics, signal and audio processing, medical image processing, etc. Via the penalization of l-1 norm based regularization, the structured sparse learning algorithms can produce highly accurate models while imposing various predefined structures on the data, such as feature groups or graphs. In this thesis, I first propose to solve a sparse learning model with a general group structure, where the predefined groups may overlap with each other. Then, I present three real world applications which can benefit from the group structured sparse learning technique. In the first application, I study the Alzheimer's Disease diagnosis problem using multi-modality neuroimaging data. In this dataset, not every subject has all data sources available, exhibiting an unique and challenging block-wise missing pattern. In the second application, I study the automatic annotation and retrieval of fruit-fly gene expression pattern images. Combined with the spatial information, sparse learning techniques can be used to construct effective representation of the expression images. In the third application, I present a new computational approach to annotate developmental stage for Drosophila embryos in the gene expression images. In addition, it provides a stage score that enables one to more finely annotate each embryo so that they are divided into early and late periods of development within standard stage demarcations. Stage scores help us to illuminate global gene activities and changes much better, and more refined stage annotations improve our ability to better interpret results when expression pattern matches are discovered between genes.

ContributorsYuan, Lei (Author) / Ye, Jieping (Thesis advisor) / Wang, Yalin (Committee member) / Xue, Guoliang (Committee member) / Kumar, Sudhir (Committee member) / Arizona State University (Publisher)

Created2013

Answer set programming and other computing paradigms

Description

Answer Set Programming (ASP) is one of the most prominent and successful knowledge representation paradigms. The success of ASP is due to its expressive non-monotonic modeling language and its efficient computational methods originating from building propositional satisfiability solvers. The wide adoption of ASP has motivated several extensions to its modeling…

Answer Set Programming (ASP) is one of the most prominent and successful knowledge representation paradigms. The success of ASP is due to its expressive non-monotonic modeling language and its efficient computational methods originating from building propositional satisfiability solvers. The wide adoption of ASP has motivated several extensions to its modeling language in order to enhance expressivity, such as incorporating aggregates and interfaces with ontologies. Also, in order to overcome the grounding bottleneck of computation in ASP, there are increasing interests in integrating ASP with other computing paradigms, such as Constraint Programming (CP) and Satisfiability Modulo Theories (SMT). Due to the non-monotonic nature of the ASP semantics, such enhancements turned out to be non-trivial and the existing extensions are not fully satisfactory. We observe that one main reason for the difficulties rooted in the propositional semantics of ASP, which is limited in handling first-order constructs (such as aggregates and ontologies) and functions (such as constraint variables in CP and SMT) in natural ways. This dissertation presents a unifying view on these extensions by viewing them as instances of formulas with generalized quantifiers and intensional functions. We extend the first-order stable model semantics by by Ferraris, Lee, and Lifschitz to allow generalized quantifiers, which cover aggregate, DL-atoms, constraints and SMT theory atoms as special cases. Using this unifying framework, we study and relate different extensions of ASP. We also present a tight integration of ASP with SMT, based on which we enhance action language C+ to handle reasoning about continuous changes. Our framework yields a systematic approach to study and extend non-monotonic languages.

ContributorsMeng, Yunsong (Author) / Lee, Joohyung (Thesis advisor) / Ahn, Gail-Joon (Committee member) / Baral, Chitta (Committee member) / Fainekos, Georgios (Committee member) / Lifschitz, Vladimir (Committee member) / Arizona State University (Publisher)

Created2013

Optimization for resource-constrained wireless networks

Description

Nowadays, wireless communications and networks have been widely used in our daily lives. One of the most important topics related to networking research is using optimization tools to improve the utilization of network resources. In this dissertation, we concentrate on optimization for resource-constrained wireless networks, and study two fundamental resource-allocation…

Nowadays, wireless communications and networks have been widely used in our daily lives. One of the most important topics related to networking research is using optimization tools to improve the utilization of network resources. In this dissertation, we concentrate on optimization for resource-constrained wireless networks, and study two fundamental resource-allocation problems: 1) distributed routing optimization and 2) anypath routing optimization. The study on the distributed routing optimization problem is composed of two main thrusts, targeted at understanding distributed routing and resource optimization for multihop wireless networks. The first thrust is dedicated to understanding the impact of full-duplex transmission on wireless network resource optimization. We propose two provably good distributed algorithms to optimize the resources in a full-duplex wireless network. We prove their optimality and also provide network status analysis using dual space information. The second thrust is dedicated to understanding the influence of network entity load constraints on network resource allocation and routing computation. We propose a provably good distributed algorithm to allocate wireless resources. In addition, we propose a new subgradient optimization framework, which can provide findgrained convergence, optimality, and dual space information at each iteration. This framework can provide a useful theoretical foundation for many networking optimization problems. The study on the anypath routing optimization problem is composed of two main thrusts. The first thrust is dedicated to understanding the computational complexity of multi-constrained anypath routing and designing approximate solutions. We prove that this problem is NP-hard when the number of constraints is larger than one. We present two polynomial time K-approximation algorithms. One is a centralized algorithm while the other one is a distributed algorithm. For the second thrust, we study directional anypath routing and present a cross-layer design of MAC and routing. For the MAC layer, we present a directional anycast MAC. For the routing layer, we propose two polynomial time routing algorithms to compute directional anypaths based on two antenna models, and prove their ptimality based on the packet delivery ratio metric.

ContributorsFang, Xi (Author) / Xue, Guoliang (Thesis advisor) / Yau, Sik-Sang (Committee member) / Ye, Jieping (Committee member) / Zhang, Junshan (Committee member) / Arizona State University (Publisher)

Created2013

Design, analysis and resource allocations in networks in presence of region-based faults

Description

Communication networks, both wired and wireless, are expected to have a certain level of fault-tolerance capability.These networks are also expected to ensure a graceful degradation in performance when some of the network components fail. Traditional studies on fault tolerance in communication networks, for the most part, make no assumptions regarding…

Communication networks, both wired and wireless, are expected to have a certain level of fault-tolerance capability.These networks are also expected to ensure a graceful degradation in performance when some of the network components fail. Traditional studies on fault tolerance in communication networks, for the most part, make no assumptions regarding the location of node/link faults, i.e., the faulty nodes and links may be close to each other or far from each other. However, in many real life scenarios, there exists a strong spatial correlation among the faulty nodes and links. Such failures are often encountered in disaster situations, e.g., natural calamities or enemy attacks. In presence of such region-based faults, many of traditional network analysis and fault-tolerant metrics, that are valid under non-spatially correlated faults, are no longer applicable. To this effect, the main thrust of this research is design and analysis of robust networks in presence of such region-based faults. One important finding of this research is that if some prior knowledge is available on the maximum size of the region that might be affected due to a region-based fault, this piece of knowledge can be effectively utilized for resource efficient design of networks. It has been shown in this dissertation that in some scenarios, effective utilization of this knowledge may result in substantial saving is transmission power in wireless networks. In this dissertation, the impact of region-based faults on the connectivity of wireless networks has been studied and a new metric, region-based connectivity, is proposed to measure the fault-tolerance capability of a network. In addition, novel metrics, such as the region-based component decomposition number(RBCDN) and region-based largest component size(RBLCS) have been proposed to capture the network state, when a region-based fault disconnects the network. Finally, this dissertation presents efficient resource allocation techniques that ensure tolerance against region-based faults, in distributed file storage networks and data center networks.

ContributorsBanerjee, Sujogya (Author) / Sen, Arunabha (Thesis advisor) / Xue, Guoliang (Committee member) / Richa, Andrea (Committee member) / Hurlbert, Glenn (Committee member) / Arizona State University (Publisher)

Created2013

Coping with selfish behavior in networks using game theory

Description

While network problems have been addressed using a central administrative domain with a single objective, the devices in most networks are actually not owned by a single entity but by many individual entities. These entities make their decisions independently and selfishly, and maybe cooperate with a small group of other…

While network problems have been addressed using a central administrative domain with a single objective, the devices in most networks are actually not owned by a single entity but by many individual entities. These entities make their decisions independently and selfishly, and maybe cooperate with a small group of other entities only when this form of coalition yields a better return. The interaction among multiple independent decision-makers necessitates the use of game theory, including economic notions related to markets and incentives. In this dissertation, we are interested in modeling, analyzing, addressing network problems caused by the selfish behavior of network entities. First, we study how the selfish behavior of network entities affects the system performance while users are competing for limited resource. For this resource allocation domain, we aim to study the selfish routing problem in networks with fair queuing on links, the relay assignment problem in cooperative networks, and the channel allocation problem in wireless networks. Another important aspect of this dissertation is the study of designing efficient mechanisms to incentivize network entities to achieve certain system objective. For this incentive mechanism domain, we aim to motivate wireless devices to serve as relays for cooperative communication, and to recruit smartphones for crowdsourcing. In addition, we apply different game theoretic approaches to problems in security and privacy domain. For this domain, we aim to analyze how a user could defend against a smart jammer, who can quickly learn about the user's transmission power. We also design mechanisms to encourage mobile phone users to participate in location privacy protection, in order to achieve k-anonymity.

ContributorsYang, Dejun (Author) / Xue, Guoliang (Thesis advisor) / Richa, Andrea (Committee member) / Sen, Arunabha (Committee member) / Zhang, Junshan (Committee member) / Arizona State University (Publisher)

Created2013

Detection of advanced bots in smartphones through user profiling

Description

This thesis addresses the ever increasing threat of botnets in the smartphone domain and focuses on the Android platform and the botnets using Online Social Networks (OSNs) as Command and Control (C&C;) medium. With any botnet, C&C; is one of the components on which the survival of botnet depends. Individual…

This thesis addresses the ever increasing threat of botnets in the smartphone domain and focuses on the Android platform and the botnets using Online Social Networks (OSNs) as Command and Control (C&C;) medium. With any botnet, C&C; is one of the components on which the survival of botnet depends. Individual bots use the C&C; channel to receive commands and send the data. This thesis develops active host based approach for identifying the presence of bot based on the anomalies in the usage patterns of the user before and after the bot is installed on the user smartphone and alerting the user to the presence of the bot. A profile is constructed for each user based on the regular web usage patterns (achieved by intercepting the http(s) traffic) and implementing machine learning techniques to continuously learn the user's behavior and changes in the behavior and all the while looking for any anomalies in the user behavior above a threshold which will cause the user to be notified of the anomalous traffic. A prototype bot which uses OSN s as C&C; channel is constructed and used for testing. Users are given smartphones(Nexus 4 and Galaxy Nexus) running Application proxy which intercepts http(s) traffic and relay it to a server which uses the traffic and constructs the model for a particular user and look for any signs of anomalies. This approach lays the groundwork for the future host-based counter measures for smartphone botnets using OSN s as C&C; channel.

ContributorsKilari, Vishnu Teja (Author) / Xue, Guoliang (Thesis advisor) / Ahn, Gail-Joon (Committee member) / Dasgupta, Partha (Committee member) / Arizona State University (Publisher)

Created2013

Filtering by