Search Content

Simultaneous variable and feature group selection in heterogeneous learning: optimization and applications

Description

Advances in data collection technologies have made it cost-effective to obtain heterogeneous data from multiple data sources. Very often, the data are of very high dimension and feature selection is preferred in order to reduce noise, save computational cost and learn interpretable models. Due to the multi-modality nature of heterogeneous…

Advances in data collection technologies have made it cost-effective to obtain heterogeneous data from multiple data sources. Very often, the data are of very high dimension and feature selection is preferred in order to reduce noise, save computational cost and learn interpretable models. Due to the multi-modality nature of heterogeneous data, it is interesting to design efficient machine learning models that are capable of performing variable selection and feature group (data source) selection simultaneously (a.k.a bi-level selection). In this thesis, I carry out research along this direction with a particular focus on designing efficient optimization algorithms. I start with a unified bi-level learning model that contains several existing feature selection models as special cases. Then the proposed model is further extended to tackle the block-wise missing data, one of the major challenges in the diagnosis of Alzheimer's Disease (AD). Moreover, I propose a novel interpretable sparse group feature selection model that greatly facilitates the procedure of parameter tuning and model selection. Last but not least, I show that by solving the sparse group hard thresholding problem directly, the sparse group feature selection model can be further improved in terms of both algorithmic complexity and efficiency. Promising results are demonstrated in the extensive evaluation on multiple real-world data sets.

ContributorsXiang, Shuo (Author) / Ye, Jieping (Thesis advisor) / Mittelmann, Hans D (Committee member) / Davulcu, Hasan (Committee member) / He, Jingrui (Committee member) / Arizona State University (Publisher)

Created2014

Adaptive sampling and learning in recommendation systems

Description

This thesis studies recommendation systems and considers joint sampling and learning. Sampling in recommendation systems is to obtain users' ratings on specific items chosen by the recommendation platform, and learning is to infer the unknown ratings of users to items given the existing data. In this thesis, the problem is…

This thesis studies recommendation systems and considers joint sampling and learning. Sampling in recommendation systems is to obtain users' ratings on specific items chosen by the recommendation platform, and learning is to infer the unknown ratings of users to items given the existing data. In this thesis, the problem is formulated as an adaptive matrix completion problem in which sampling is to reveal the unknown entries of a $U\times M$ matrix where $U$ is the number of users, $M$ is the number of items, and each entry of the $U\times M$ matrix represents the rating of a user to an item. In the literature, this matrix completion problem has been studied under a static setting, i.e., recovering the matrix based on a set of partial ratings. This thesis considers both sampling and learning, and proposes an adaptive algorithm. The algorithm adapts its sampling and learning based on the existing data. The idea is to sample items that reveal more information based on the previous sampling results and then learn based on clustering. Performance of the proposed algorithm has been evaluated using simulations.

ContributorsZhu, Lingfang (Author) / Xue, Guoliang (Thesis advisor) / He, Jingrui (Committee member) / Tong, Hanghang (Committee member) / Arizona State University (Publisher)

Created2015

Diffusion in Networks: Source Localization, History Reconstruction and Real-Time Network Robustification

Description

Diffusion processes in networks can be used to model many real-world processes, such as the propagation of a rumor on social networks and cascading failures on power networks. Analysis of diffusion processes in networks can help us answer important questions such as the role and the importance of each node…

Diffusion processes in networks can be used to model many real-world processes, such as the propagation of a rumor on social networks and cascading failures on power networks. Analysis of diffusion processes in networks can help us answer important questions such as the role and the importance of each node in the network for spreading the diffusion and how to top or contain a cascading failure in the network. This dissertation consists of three parts.

In the first part, we study the problem of locating multiple diffusion sources in networks under the Susceptible-Infected-Recovered (SIR) model. Given a complete snapshot of the network, we developed a sample-path-based algorithm, named clustering and localization, and proved that for regular trees, the estimators produced by the proposed algorithm are within a constant distance from the real sources with a high probability. Then, we considered the case in which only a partial snapshot is observed and proposed a new algorithm, named Optimal-Jordan-Cover (OJC). The algorithm first extracts a subgraph using a candidate selection algorithm that selects source candidates based on the number of observed infected nodes in their neighborhoods. Then, in the extracted subgraph, OJC finds a set of nodes that "cover" all observed infected nodes with the minimum radius. The set of nodes is called the Jordan cover, and is regarded as the set of diffusion sources. We proved that OJC can locate all sources with probability one asymptotically with partial observations in the Erdos-Renyi (ER) random graph. Multiple experiments on different networks were done, which show our algorithms outperform others.

In the second part, we tackle the problem of reconstructing the diffusion history from partial observations. We formulated the diffusion history reconstruction problem as a maximum a posteriori (MAP) problem and proved the problem is NP hard. Then we proposed a step-by- step reconstruction algorithm, which can always produce a diffusion history that is consistent with the partial observations. Our experimental results based on synthetic and real networks show that the algorithm significantly outperforms some existing methods.

In the third part, we consider the problem of improving the robustness of an interdependent network by rewiring a small number of links during a cascading attack. We formulated the problem as a Markov decision process (MDP) problem. While the problem is NP-hard, we developed an effective and efficient algorithm, RealWire, to robustify the network and to mitigate the damage during the attack. Extensive experimental results show that our algorithm outperforms other algorithms on most of the robustness metrics.

ContributorsChen, Zhen (Author) / Ying, Lei (Thesis advisor) / Tong, Hanghang (Thesis advisor) / Zhang, Junshan (Committee member) / He, Jingrui (Committee member) / Arizona State University (Publisher)

Created2018

Multi-layered HITS on Multi-sourced Networks

Description

Network mining has been attracting a lot of research attention because of the prevalence of networks. As the world is becoming increasingly connected and correlated, networks arising from inter-dependent application domains are often collected from different sources, forming the so-called multi-sourced networks. Examples of such multi-sourced networks include critical infrastructure…

Network mining has been attracting a lot of research attention because of the prevalence of networks. As the world is becoming increasingly connected and correlated, networks arising from inter-dependent application domains are often collected from different sources, forming the so-called multi-sourced networks. Examples of such multi-sourced networks include critical infrastructure networks, multi-platform social networks, cross-domain collaboration networks, and many more. Compared with single-sourced network, multi-sourced networks bear more complex structures and therefore could potentially contain more valuable information.

This thesis proposes a multi-layered HITS (Hyperlink-Induced Topic Search) algorithm to perform the ranking task on multi-sourced networks. Specifically, each node in the network receives an authority score and a hub score for evaluating the value of the node itself and the value of its outgoing links respectively. Based on a recent multi-layered network model, which allows more flexible dependency structure across different sources (i.e., layers), the proposed algorithm leverages both within-layer smoothness and cross-layer consistency. This essentially allows nodes from different layers to be ranked accordingly. The multi-layered HITS is formulated as a regularized optimization problem with non-negative constraint and solved by an iterative update process. Extensive experimental evaluations demonstrate the effectiveness and explainability of the proposed algorithm.

ContributorsYu, Haichao (Author) / Tong, Hanghang (Thesis advisor) / He, Jingrui (Committee member) / Yang, Yezhou (Committee member) / Arizona State University (Publisher)

Created2018

Deep Temporal Clustering: Fully Unsupervised Learning of Time-Domain Features

Description

Unsupervised learning of time series data, also known as temporal clustering, is a challenging problem in machine learning. This thesis presents a novel algorithm, Deep Temporal Clustering (DTC), to naturally integrate dimensionality reduction and temporal clustering into a single end-to-end learning framework, fully unsupervised. The algorithm utilizes an autoencoder for…

Unsupervised learning of time series data, also known as temporal clustering, is a challenging problem in machine learning. This thesis presents a novel algorithm, Deep Temporal Clustering (DTC), to naturally integrate dimensionality reduction and temporal clustering into a single end-to-end learning framework, fully unsupervised. The algorithm utilizes an autoencoder for temporal dimensionality reduction and a novel temporal clustering layer for cluster assignment. Then it jointly optimizes the clustering objective and the dimensionality reduction objective. Based on requirement and application, the temporal clustering layer can be customized with any temporal similarity metric. Several similarity metrics and state-of-the-art algorithms are considered and compared. To gain insight into temporal features that the network has learned for its clustering, a visualization method is applied that generates a region of interest heatmap for the time series. The viability of the algorithm is demonstrated using time series data from diverse domains, ranging from earthquakes to spacecraft sensor data. In each case, the proposed algorithm outperforms traditional methods. The superior performance is attributed to the fully integrated temporal dimensionality reduction and clustering criterion.

ContributorsMadiraju, NaveenSai (Author) / Liang, Jianming (Thesis advisor) / Wang, Yalin (Thesis advisor) / He, Jingrui (Committee member) / Arizona State University (Publisher)

Created2018

Adaptive Curvature for Stochastic Optimization

Description

This thesis presents a family of adaptive curvature methods for gradient-based stochastic optimization. In particular, a general algorithmic framework is introduced along with a practical implementation that yields an efficient, adaptive curvature gradient descent algorithm. To this end, a theoretical and practical link between curvature matrix estimation and shrinkage methods…

This thesis presents a family of adaptive curvature methods for gradient-based stochastic optimization. In particular, a general algorithmic framework is introduced along with a practical implementation that yields an efficient, adaptive curvature gradient descent algorithm. To this end, a theoretical and practical link between curvature matrix estimation and shrinkage methods for covariance matrices is established. The use of shrinkage improves estimation accuracy of the curvature matrix when data samples are scarce. This thesis also introduce several insights that result in data- and computation-efficient update equations. Empirical results suggest that the proposed method compares favorably with existing second-order techniques based on the Fisher or Gauss-Newton and with adaptive stochastic gradient descent methods on both supervised and reinforcement learning tasks.

ContributorsBarron, Trevor (Author) / Ben Amor, Heni (Thesis advisor) / He, Jingrui (Committee member) / Levihn, Martin (Committee member) / Arizona State University (Publisher)

Created2019

Model Based Automatic and Robust Spike Sorting for Large Volumes of Multi-channel Extracellular Data

Description

Spike sorting is a critical step for single-unit-based analysis of neural activities extracellularly and simultaneously recorded using multi-channel electrodes. When dealing with recordings from very large numbers of neurons, existing methods, which are mostly semiautomatic in nature, become inadequate.

This dissertation aims at automating the spike sorting process. A high performance,…

Spike sorting is a critical step for single-unit-based analysis of neural activities extracellularly and simultaneously recorded using multi-channel electrodes. When dealing with recordings from very large numbers of neurons, existing methods, which are mostly semiautomatic in nature, become inadequate.

This dissertation aims at automating the spike sorting process. A high performance, automatic and computationally efficient spike detection and clustering system, namely, the M-Sorter2 is presented. The M-Sorter2 employs the modified multiscale correlation of wavelet coefficients (MCWC) for neural spike detection. At the center of the proposed M-Sorter2 are two automatic spike clustering methods. They share a common hierarchical agglomerative modeling (HAM) model search procedure to strategically form a sequence of mixture models, and a new model selection criterion called difference of model evidence (DoME) to automatically determine the number of clusters. The M-Sorter2 employs two methods differing by how they perform clustering to infer model parameters: one uses robust variational Bayes (RVB) and the other uses robust Expectation-Maximization (REM) for Student’s 𝑡-mixture modeling. The M-Sorter2 is thus a significantly improved approach to sorting as an automatic procedure.

M-Sorter2 was evaluated and benchmarked with popular algorithms using simulated, artificial and real data with truth that are openly available to researchers. Simulated datasets with known statistical distributions were first used to illustrate how the clustering algorithms, namely REMHAM and RVBHAM, provide robust clustering results under commonly experienced performance degrading conditions, such as random initialization of parameters, high dimensionality of data, low signal-to-noise ratio (SNR), ambiguous clusters, and asymmetry in cluster sizes. For the artificial dataset from single-channel recordings, the proposed sorter outperformed Wave_Clus, Plexon’s Offline Sorter and Klusta in most of the comparison cases. For the real dataset from multi-channel electrodes, tetrodes and polytrodes, the proposed sorter outperformed all comparison algorithms in terms of false positive and false negative rates. The software package presented in this dissertation is available for open access.

ContributorsMa, Weichao (Author) / Si, Jennie (Thesis advisor) / Papandreou-Suppappola, Antonia (Committee member) / He, Jingrui (Committee member) / Helms Tillery, Stephen (Committee member) / Arizona State University (Publisher)

Created2019

The Pathfinder Center Stories Project: Narratives from Student Experiences in College

Description

This paper considers what factors influence student interest, motivation, and continued engagement. Studies show anticipated extrinsic rewards for activity participation have been shown to reduce intrinsic value for that activity. This might suggest that grade point average (GPA) has a similar effect on academic interests. Further, when incentives such as…

This paper considers what factors influence student interest, motivation, and continued engagement. Studies show anticipated extrinsic rewards for activity participation have been shown to reduce intrinsic value for that activity. This might suggest that grade point average (GPA) has a similar effect on academic interests. Further, when incentives such as scholarships, internships, and careers are GPA-oriented, students must adopt performance goals in courses to guarantee success. However, performance goals have not been shown to correlated with continued interest in a topic. Current literature proposes that student involvement in extracurricular activities, focused study groups, and mentored research are crucial to student success. Further, students may express either a fixed or growth mindset, which influences their approach to challenges and opportunities for growth. The purpose of this study was to collect individual cases of students' experiences in college. The interview method was chosen to collect complex information that could not be gathered from standard surveys. To accomplish this, questions were developed based on content areas related to education and motivation theory. The content areas included activities and meaning, motivation, vision, and personal development. The developed interview method relied on broad questions that would be followed by specific "probing" questions. We hypothesize that this would result in participant-led discussions and unique narratives from the participant. Initial findings suggest that some of the questions were effective in eliciting detailed responses, though results were dependent on the interviewer. From the interviews we find that students value their group involvements, leadership opportunities, and relationships with mentors, which parallels results found in other studies.

ContributorsAbrams, Sara (Author) / Hartwell, Lee (Thesis director) / Correa, Kevin (Committee member) / Department of Psychology (Contributor) / School of Mathematical and Statistical Sciences (Contributor) / Barrett, The Honors College (Contributor)

Created2018-05

Conservation of m6A in evolving long-term E. coli populations

Description

Many factors are at play within the genome of an organism, contributing to much of the diversity and variation across the tree of life. While the genome is generally encoded by four nucleotides, A, C, T, and G, this code can be expanded. One particular mechanism that we examine in…

Many factors are at play within the genome of an organism, contributing to much of the diversity and variation across the tree of life. While the genome is generally encoded by four nucleotides, A, C, T, and G, this code can be expanded. One particular mechanism that we examine in this thesis is modification of bases—more specifically, methylation of Adenine (m6A) within the GATC motif of Escherichia coli. These methylated adenines are especially important in a process called methyl-directed mismatch repair (MMR), a pathway responsible for repairing errors in the DNA sequence produced by replication. In this pathway, methylated adenines identify the parent strand and direct the repair proteins to correct the erroneous base in the daughter strand. While the primary role of methylated adenines at GATC sites is to direct the MMR pathway, this methylation has also been found to affect other processes, such as gene expression, the activity of transposable elements, and the timing of DNA replication. However, in the absence of MMR, the ability of these other processes to maintain adenine methylation and its targets is unknown.
To determine if the disruption of the MMR pathway results in the reduced conservation of methylated adenines as well as an increased tolerance for mutations that result in the loss or gain of new GATC sites, we surveyed individual clones isolated from experimentally evolving wild-type and MMR-deficient (mutL- ;conferring an 150x increase in mutation rate) populations of E. coli with whole-genome sequencing. Initial analysis revealed a lack of mutations affecting methylation sites (GATC tetranucleotides) in wild-type clones. However, the inherent low mutation rates conferred by the wild-type background render this result inconclusive, due to a lack of statistical power, and reveal a need for a more direct measure of changes in methylation status. Thus as a first step to comparative methylomics, we benchmarked four different methylation-calling pipelines on three biological replicates of the wildtype progenitor strain for our evolved populations.
While it is understood that these methylated sites play a role in the MMR pathway, it is not fully understood the full extent of their effect on the genome. Thus the goal of this thesis was to better understand the forces which maintain the genome, specifically concerning m6A within the GATC motif.

ContributorsBoyer, Gwyneth (Author) / Lynch, Michael (Thesis director) / Behringer, Megan (Committee member) / Geiler-Samerotte, Kerry (Committee member) / School of Life Sciences (Contributor) / Department of Psychology (Contributor) / Barrett, The Honors College (Contributor)

Created2020-05

Optimal Prosthetic Development for the Good of all Amputees

Description

This paper emphasizes how vital prosthetic devices are as tools for both congenital and acquired amputees in order to maximize this population's level of societal productivity, but several issues exist with the current technological focus of development by the prosthetic industry that creates unnecessary hurdles that amputees must surpass in…

This paper emphasizes how vital prosthetic devices are as tools for both congenital and acquired amputees in order to maximize this population's level of societal productivity, but several issues exist with the current technological focus of development by the prosthetic industry that creates unnecessary hurdles that amputees must surpass in order to truly benefit from these tools. The first major issue is that these devices are not readily available to all amputees. The astronomical cost of most prosthetic devices is a variable that restricts low income amputee populations from obtaining these vital tools regardless of their level of need, thus highlighting the fact that amputees who are not financially stable are not supported in a fashion that is conducive to their success. Also, cost greatly affects children who suffer from a missing appendage due to the fact that they are in constant need of prosthetic replacement because of physical growth and development. Another issue with the current focus of the prosthetic industry is that it focuses on acquired amputees because this population is much larger in comparison to congenital amputees and thus more lucrative. Congenital amputees' particular needs are often entirely ignored in terms of prosthetic innovation. Finally, low daily utilization is a major issue amongst the amputee population. Several variables exist with the use of prosthetic devices that cause many amputees to decide against the utilization of these tools, like difficulty of use and lack of comfort. This paper will provide solutions to cost, discrimination, issues in development, and daily utilization by emphasizing on how lowering the cost through alternative designs and materials, transitioning the focus of technological development onto the entire amputee population rather than targeting the most lucrative group, and advancing the design in a fashion to which promotes daily utilization will provide the largest level of societal support, so that the amputee population as a whole can maximize their level of productivity in a manner that will allow this group to conquer the hardships that are introduced into their lives due to a missing appendage.

ContributorsO'Connor, Casey Charles (Author) / Popova, Laura (Thesis director) / Graff, Sarah (Committee member) / Department of Psychology (Contributor) / School of Social Work (Contributor) / Barrett, The Honors College (Contributor)

Created2018-05

Filtering by