Matching Items (16)
Filtering by

Clear all filters

151511-Thumbnail Image.png
Description
With the increase in computing power and availability of data, there has never been a greater need to understand data and make decisions from it. Traditional statistical techniques may not be adequate to handle the size of today's data or the complexities of the information hidden within the data. Thus

With the increase in computing power and availability of data, there has never been a greater need to understand data and make decisions from it. Traditional statistical techniques may not be adequate to handle the size of today's data or the complexities of the information hidden within the data. Thus knowledge discovery by machine learning techniques is necessary if we want to better understand information from data. In this dissertation, we explore the topics of asymmetric loss and asymmetric data in machine learning and propose new algorithms as solutions to some of the problems in these topics. We also studied variable selection of matched data sets and proposed a solution when there is non-linearity in the matched data. The research is divided into three parts. The first part addresses the problem of asymmetric loss. A proposed asymmetric support vector machine (aSVM) is used to predict specific classes with high accuracy. aSVM was shown to produce higher precision than a regular SVM. The second part addresses asymmetric data sets where variables are only predictive for a subset of the predictor classes. Asymmetric Random Forest (ARF) was proposed to detect these kinds of variables. The third part explores variable selection for matched data sets. Matched Random Forest (MRF) was proposed to find variables that are able to distinguish case and control without the restrictions that exists in linear models. MRF detects variables that are able to distinguish case and control even in the presence of interaction and qualitative variables.
ContributorsKoh, Derek (Author) / Runger, George C. (Thesis advisor) / Wu, Tong (Committee member) / Pan, Rong (Committee member) / Cesta, John (Committee member) / Arizona State University (Publisher)
Created2013
153063-Thumbnail Image.png
Description
Technological advances have enabled the generation and collection of various data from complex systems, thus, creating ample opportunity to integrate knowledge in many decision making applications. This dissertation introduces holistic learning as the integration of a comprehensive set of relationships that are used towards the learning objective. The holistic view

Technological advances have enabled the generation and collection of various data from complex systems, thus, creating ample opportunity to integrate knowledge in many decision making applications. This dissertation introduces holistic learning as the integration of a comprehensive set of relationships that are used towards the learning objective. The holistic view of the problem allows for richer learning from data and, thereby, improves decision making.

The first topic of this dissertation is the prediction of several target attributes using a common set of predictor attributes. In a holistic learning approach, the relationships between target attributes are embedded into the learning algorithm created in this dissertation. Specifically, a novel tree based ensemble that leverages the relationships between target attributes towards constructing a diverse, yet strong, model is proposed. The method is justified through its connection to existing methods and experimental evaluations on synthetic and real data.

The second topic pertains to monitoring complex systems that are modeled as networks. Such systems present a rich set of attributes and relationships for which holistic learning is important. In social networks, for example, in addition to friendship ties, various attributes concerning the users' gender, age, topic of messages, time of messages, etc. are collected. A restricted form of monitoring fails to take the relationships of multiple attributes into account, whereas the holistic view embeds such relationships in the monitoring methods. The focus is on the difficult task to detect a change that might only impact a small subset of the network and only occur in a sub-region of the high-dimensional space of the network attributes. One contribution is a monitoring algorithm based on a network statistical model. Another contribution is a transactional model that transforms the task into an expedient structure for machine learning, along with a generalizable algorithm to monitor the attributed network. A learning step in this algorithm adapts to changes that may only be local to sub-regions (with a broader potential for other learning tasks). Diagnostic tools to interpret the change are provided. This robust, generalizable, holistic monitoring method is elaborated on synthetic and real networks.
ContributorsAzarnoush, Bahareh (Author) / Runger, George C. (Thesis advisor) / Bekki, Jennifer (Thesis advisor) / Pan, Rong (Committee member) / Saghafian, Soroush (Committee member) / Arizona State University (Publisher)
Created2014
151051-Thumbnail Image.png
Description
Today's competitive markets force companies to constantly engage in the complex task of managing their demand. In make-to-order manufacturing or service systems, the demand of a product is shaped by price and lead times, where high price and lead time quotes ensure profitability for supplier, but discourage the customers from

Today's competitive markets force companies to constantly engage in the complex task of managing their demand. In make-to-order manufacturing or service systems, the demand of a product is shaped by price and lead times, where high price and lead time quotes ensure profitability for supplier, but discourage the customers from placing orders. Low price and lead times, on the other hand, generally result in high demand, but do not necessarily ensure profitability. The price and lead time quotation problem considers the trade-off between offering high and low prices and lead times. The recent practices in make-to- order manufacturing companies reveal the importance of dynamic quotation strategies, under which the prices and lead time quotes flexibly change depending on the status of the system. In this dissertation, the objective is to model a make-to-order manufacturing system and explore various aspects of dynamic quotation strategies such as the behavior of optimal price and lead time decisions, the impact of customer preferences on optimal decisions, the benefits of employing dynamic quotation in comparison to simpler quotation strategies, and the benefits of coordinating price and lead time decisions. I first consider a manufacturer that receives demand from spot purchasers (who are quoted dynamic price and lead times), as well as from contract customers who have agree- ments with the manufacturer with fixed price and lead time terms. I analyze how customer preferences affect the optimal price and lead time decisions, the benefits of dynamic quo- tation, and the optimal mix of spot purchaser and contract customers. These analyses necessitate the computation of expected tardiness of customer orders at the moment cus- tomer enters the system. Hence, in the second part of the dissertation, I develop method- ologies to compute the expected tardiness in multi-class priority queues. For the trivial single class case, a closed formulation is obtained. For the more complex multi-class case, numerical inverse Laplace transformation algorithms are developed. In the last part of the dissertation, I model a decentralized system with two components. Marketing department determines the price quotes with the objective of maximizing revenues, and manufacturing department determines the lead time quotes to minimize lateness costs. I discuss the ben- efits of coordinating price and lead time decisions, and develop an incentivization scheme to reduce the negative impacts of lack of coordination.
ContributorsHafizoglu, Ahmet Baykal (Author) / Gel, Esma S (Thesis advisor) / Villalobos, Jesus R (Committee member) / Mirchandani, Pitu (Committee member) / Keskinocak, Pinar (Committee member) / Runger, George C. (Committee member) / Arizona State University (Publisher)
Created2012
151226-Thumbnail Image.png
Description
Temporal data are increasingly prevalent and important in analytics. Time series (TS) data are chronological sequences of observations and an important class of temporal data. Fields such as medicine, finance, learning science and multimedia naturally generate TS data. Each series provide a high-dimensional data vector that challenges the learning of

Temporal data are increasingly prevalent and important in analytics. Time series (TS) data are chronological sequences of observations and an important class of temporal data. Fields such as medicine, finance, learning science and multimedia naturally generate TS data. Each series provide a high-dimensional data vector that challenges the learning of the relevant patterns This dissertation proposes TS representations and methods for supervised TS analysis. The approaches combine new representations that handle translations and dilations of patterns with bag-of-features strategies and tree-based ensemble learning. This provides flexibility in handling time-warped patterns in a computationally efficient way. The ensemble learners provide a classification framework that can handle high-dimensional feature spaces, multiple classes and interaction between features. The proposed representations are useful for classification and interpretation of the TS data of varying complexity. The first contribution handles the problem of time warping with a feature-based approach. An interval selection and local feature extraction strategy is proposed to learn a bag-of-features representation. This is distinctly different from common similarity-based time warping. This allows for additional features (such as pattern location) to be easily integrated into the models. The learners have the capability to account for the temporal information through the recursive partitioning method. The second contribution focuses on the comprehensibility of the models. A new representation is integrated with local feature importance measures from tree-based ensembles, to diagnose and interpret time intervals that are important to the model. Multivariate time series (MTS) are especially challenging because the input consists of a collection of TS and both features within TS and interactions between TS can be important to models. Another contribution uses a different representation to produce computationally efficient strategies that learn a symbolic representation for MTS. Relationships between the multiple TS, nominal and missing values are handled with tree-based learners. Applications such as speech recognition, medical diagnosis and gesture recognition are used to illustrate the methods. Experimental results show that the TS representations and methods provide better results than competitive methods on a comprehensive collection of benchmark datasets. Moreover, the proposed approaches naturally provide solutions to similarity analysis, predictive pattern discovery and feature selection.
ContributorsBaydogan, Mustafa Gokce (Author) / Runger, George C. (Thesis advisor) / Atkinson, Robert (Committee member) / Gel, Esma (Committee member) / Pan, Rong (Committee member) / Arizona State University (Publisher)
Created2012
154115-Thumbnail Image.png
Description
Functional or dynamic responses are prevalent in experiments in the fields of engineering, medicine, and the sciences, but proposals for optimal designs are still sparse for this type of response. Experiments with dynamic responses result in multiple responses taken over a spectrum variable, so the design matrix for a dynamic

Functional or dynamic responses are prevalent in experiments in the fields of engineering, medicine, and the sciences, but proposals for optimal designs are still sparse for this type of response. Experiments with dynamic responses result in multiple responses taken over a spectrum variable, so the design matrix for a dynamic response have more complicated structures. In the literature, the optimal design problem for some functional responses has been solved using genetic algorithm (GA) and approximate design methods. The goal of this dissertation is to develop fast computer algorithms for calculating exact D-optimal designs.



First, we demonstrated how the traditional exchange methods could be improved to generate a computationally efficient algorithm for finding G-optimal designs. The proposed two-stage algorithm, which is called the cCEA, uses a clustering-based approach to restrict the set of possible candidates for PEA, and then improves the G-efficiency using CEA.



The second major contribution of this dissertation is the development of fast algorithms for constructing D-optimal designs that determine the optimal sequence of stimuli in fMRI studies. The update formula for the determinant of the information matrix was improved by exploiting the sparseness of the information matrix, leading to faster computation times. The proposed algorithm outperforms genetic algorithm with respect to computational efficiency and D-efficiency.



The third contribution is a study of optimal experimental designs for more general functional response models. First, the B-spline system is proposed to be used as the non-parametric smoother of response function and an algorithm is developed to determine D-optimal sampling points of a spectrum variable. Second, we proposed a two-step algorithm for finding the optimal design for both sampling points and experimental settings. In the first step, the matrix of experimental settings is held fixed while the algorithm optimizes the determinant of the information matrix for a mixed effects model to find the optimal sampling times. In the second step, the optimal sampling times obtained from the first step is held fixed while the algorithm iterates on the information matrix to find the optimal experimental settings. The designs constructed by this approach yield superior performance over other designs found in literature.
ContributorsSaleh, Moein (Author) / Pan, Rong (Thesis advisor) / Montgomery, Douglas C. (Committee member) / Runger, George C. (Committee member) / Kao, Ming-Hung (Committee member) / Arizona State University (Publisher)
Created2015
156679-Thumbnail Image.png
Description
The recent technological advances enable the collection of various complex, heterogeneous and high-dimensional data in biomedical domains. The increasing availability of the high-dimensional biomedical data creates the needs of new machine learning models for effective data analysis and knowledge discovery. This dissertation introduces several unsupervised and supervised methods to hel

The recent technological advances enable the collection of various complex, heterogeneous and high-dimensional data in biomedical domains. The increasing availability of the high-dimensional biomedical data creates the needs of new machine learning models for effective data analysis and knowledge discovery. This dissertation introduces several unsupervised and supervised methods to help understand the data, discover the patterns and improve the decision making. All the proposed methods can generalize to other industrial fields.

The first topic of this dissertation focuses on the data clustering. Data clustering is often the first step for analyzing a dataset without the label information. Clustering high-dimensional data with mixed categorical and numeric attributes remains a challenging, yet important task. A clustering algorithm based on tree ensembles, CRAFTER, is proposed to tackle this task in a scalable manner.

The second part of this dissertation aims to develop data representation methods for genome sequencing data, a special type of high-dimensional data in the biomedical domain. The proposed data representation method, Bag-of-Segments, can summarize the key characteristics of the genome sequence into a small number of features with good interpretability.

The third part of this dissertation introduces an end-to-end deep neural network model, GCRNN, for time series classification with emphasis on both the accuracy and the interpretation. GCRNN contains a convolutional network component to extract high-level features, and a recurrent network component to enhance the modeling of the temporal characteristics. A feed-forward fully connected network with the sparse group lasso regularization is used to generate the final classification and provide good interpretability.

The last topic centers around the dimensionality reduction methods for time series data. A good dimensionality reduction method is important for the storage, decision making and pattern visualization for time series data. The CRNN autoencoder is proposed to not only achieve low reconstruction error, but also generate discriminative features. A variational version of this autoencoder has great potential for applications such as anomaly detection and process control.
ContributorsLin, Sangdi (Author) / Runger, George C. (Thesis advisor) / Kocher, Jean-Pierre A (Committee member) / Pan, Rong (Committee member) / Escobedo, Adolfo R. (Committee member) / Arizona State University (Publisher)
Created2018
153604-Thumbnail Image.png
Description
The complexity of supply chains (SC) has grown rapidly in recent years, resulting in an increased difficulty to evaluate and visualize performance. Consequently, analytical approaches to evaluate SC performance in near real time relative to targets and plans are important to detect and react to deviations in order to prevent

The complexity of supply chains (SC) has grown rapidly in recent years, resulting in an increased difficulty to evaluate and visualize performance. Consequently, analytical approaches to evaluate SC performance in near real time relative to targets and plans are important to detect and react to deviations in order to prevent major disruptions.

Manufacturing anomalies, inaccurate forecasts, and other problems can lead to SC disruptions. Traditional monitoring methods are not sufficient in this respect, because com- plex SCs feature changes in manufacturing tasks (dynamic complexity) and carry a large number of stock keeping units (detail complexity). Problems are easily confounded with normal system variations.

Motivated by these real challenges faced by modern SC, new surveillance solutions are proposed to detect system deviations that could lead to disruptions in a complex SC. To address supply-side deviations, the fitness of different statistics that can be extracted from the enterprise resource planning system is evaluated. A monitoring strategy is first proposed for SCs featuring high levels of dynamic complexity. This presents an opportunity for monitoring methods to be applied in a new, rich domain of SC management. Then a monitoring strategy, called Heat Map Contrasts (HMC), which converts monitoring into a series of classification problems, is used to monitor SCs with both high levels of dynamic and detail complexities. Data from a semiconductor SC simulator are used to compare the methods with other alternatives under various failure cases, and the results illustrate the viability of our methods.

To address demand-side deviations, a new method of quantifying forecast uncer- tainties using the progression of forecast updates is presented. It is illustrated that a rich amount of information is available in rolling horizon forecasts. Two proactive indicators of future forecast errors are extracted from the forecast stream. This quantitative method re- quires no knowledge of the forecasting model itself and has shown promising results when applied to two datasets consisting of real forecast updates.
ContributorsLiu, Lei (Author) / Runger, George C. (Thesis advisor) / Gel, Esma (Committee member) / Pan, Rong (Committee member) / Janakiram, Mani (Committee member) / Arizona State University (Publisher)
Created2015
155102-Thumbnail Image.png
Description
Anomaly is a deviation from the normal behavior of the system and anomaly detection techniques try to identify unusual instances based on deviation from the normal data. In this work, I propose a machine-learning algorithm, referred to as Artificial Contrasts, for anomaly detection in categorical data in which neither the

Anomaly is a deviation from the normal behavior of the system and anomaly detection techniques try to identify unusual instances based on deviation from the normal data. In this work, I propose a machine-learning algorithm, referred to as Artificial Contrasts, for anomaly detection in categorical data in which neither the dimension, the specific attributes involved, nor the form of the pattern is known a priori. I use RandomForest (RF) technique as an effective learner for artificial contrast. RF is a powerful algorithm that can handle relations of attributes in high dimensional data and detect anomalies while providing probability estimates for risk decisions.

I apply the model to two simulated data sets and one real data set. The model was able to detect anomalies with a very high accuracy. Finally, by comparing the proposed model with other models in the literature, I demonstrate superior performance of the proposed model.
ContributorsMousavi, Seyyedehnasim (Author) / Runger, George C. (Thesis advisor) / Wu, Teresa (Committee member) / Kim, Sunghoon (Committee member) / Arizona State University (Publisher)
Created2016
155361-Thumbnail Image.png
Description
This dissertation proposes a new set of analytical methods for high dimensional physiological sensors. The methodologies developed in this work were motivated by problems in learning science, but also apply to numerous disciplines where high dimensional signals are present. In the education field, more data is now available from traditional

This dissertation proposes a new set of analytical methods for high dimensional physiological sensors. The methodologies developed in this work were motivated by problems in learning science, but also apply to numerous disciplines where high dimensional signals are present. In the education field, more data is now available from traditional sources and there is an important need for analytical methods to translate this data into improved learning. Affecting Computing which is the study of new techniques that develop systems to recognize and model human emotions is integrating different physiological signals such as electroencephalogram (EEG) and electromyogram (EMG) to detect and model emotions which later can be used to improve these learning systems.

The first contribution proposes an event-crossover (ECO) methodology to analyze performance in learning environments. The methodology is relevant to studies where it is desired to evaluate the relationships between sentinel events in a learning environment and a physiological measurement which is provided in real time.

The second contribution introduces analytical methods to study relationships between multi-dimensional physiological signals and sentinel events in a learning environment. The methodology proposed learns physiological patterns in the form of node activations near time of events using different statistical techniques.

The third contribution addresses the challenge of performance prediction from physiological signals. Features from the sensors which could be computed early in the learning activity were developed for input to a machine learning model. The objective is to predict success or failure of the student in the learning environment early in the activity. EEG was used as the physiological signal to train a pattern recognition algorithm in order to derive meta affective states.

The last contribution introduced a methodology to predict a learner's performance using Bayes Belief Networks (BBNs). Posterior probabilities of latent nodes were used as inputs to a predictive model in real-time as evidence was accumulated in the BBN.

The methodology was applied to data streams from a video game and from a Damage Control Simulator which were used to predict and quantify performance. The proposed methods provide cognitive scientists with new tools to analyze subjects in learning environments.
ContributorsLujan Moreno, Gustavo A. (Author) / Runger, George C. (Thesis advisor) / Atkinson, Robert K (Thesis advisor) / Montgomery, Douglas C. (Committee member) / Villalobos, Rene (Committee member) / Arizona State University (Publisher)
Created2017
149443-Thumbnail Image.png
Description
Public health surveillance is a special case of the general problem where counts (or rates) of events are monitored for changes. Modern data complements event counts with many additional measurements (such as geographic, demographic, and others) that comprise high-dimensional covariates. This leads to an important challenge to detect a change

Public health surveillance is a special case of the general problem where counts (or rates) of events are monitored for changes. Modern data complements event counts with many additional measurements (such as geographic, demographic, and others) that comprise high-dimensional covariates. This leads to an important challenge to detect a change that only occurs within a region, initially unspecified, defined by these covariates. Current methods are typically limited to spatial and/or temporal covariate information and often fail to use all the information available in modern data that can be paramount in unveiling these subtle changes. Additional complexities associated with modern health data that are often not accounted for by traditional methods include: covariates of mixed type, missing values, and high-order interactions among covariates. This work proposes a transform of public health surveillance to supervised learning, so that an appropriate learner can inherently address all the complexities described previously. At the same time, quantitative measures from the learner can be used to define signal criteria to detect changes in rates of events. A Feature Selection (FS) method is used to identify covariates that contribute to a model and to generate a signal. A measure of statistical significance is included to control false alarms. An alternative Percentile method identifies the specific cases that lead to changes using class probability estimates from tree-based ensembles. This second method is intended to be less computationally intensive and significantly simpler to implement. Finally, a third method labeled Rule-Based Feature Value Selection (RBFVS) is proposed for identifying the specific regions in high-dimensional space where the changes are occurring. Results on simulated examples are used to compare the FS method and the Percentile method. Note this work emphasizes the application of the proposed methods on public health surveillance. Nonetheless, these methods can easily be extended to a variety of applications where counts (or rates) of events are monitored for changes. Such problems commonly occur in domains such as manufacturing, economics, environmental systems, engineering, as well as in public health.
ContributorsDavila, Saylisse (Author) / Runger, George C. (Thesis advisor) / Montgomery, Douglas C. (Committee member) / Young, Dennis (Committee member) / Gel, Esma (Committee member) / Arizona State University (Publisher)
Created2010