Matching Items (6)
Filtering by

Clear all filters

151511-Thumbnail Image.png
Description
With the increase in computing power and availability of data, there has never been a greater need to understand data and make decisions from it. Traditional statistical techniques may not be adequate to handle the size of today's data or the complexities of the information hidden within the data. Thus

With the increase in computing power and availability of data, there has never been a greater need to understand data and make decisions from it. Traditional statistical techniques may not be adequate to handle the size of today's data or the complexities of the information hidden within the data. Thus knowledge discovery by machine learning techniques is necessary if we want to better understand information from data. In this dissertation, we explore the topics of asymmetric loss and asymmetric data in machine learning and propose new algorithms as solutions to some of the problems in these topics. We also studied variable selection of matched data sets and proposed a solution when there is non-linearity in the matched data. The research is divided into three parts. The first part addresses the problem of asymmetric loss. A proposed asymmetric support vector machine (aSVM) is used to predict specific classes with high accuracy. aSVM was shown to produce higher precision than a regular SVM. The second part addresses asymmetric data sets where variables are only predictive for a subset of the predictor classes. Asymmetric Random Forest (ARF) was proposed to detect these kinds of variables. The third part explores variable selection for matched data sets. Matched Random Forest (MRF) was proposed to find variables that are able to distinguish case and control without the restrictions that exists in linear models. MRF detects variables that are able to distinguish case and control even in the presence of interaction and qualitative variables.
ContributorsKoh, Derek (Author) / Runger, George C. (Thesis advisor) / Wu, Tong (Committee member) / Pan, Rong (Committee member) / Cesta, John (Committee member) / Arizona State University (Publisher)
Created2013
150172-Thumbnail Image.png
Description
This thesis develops a low-investment marketing strategy that allows low-to-mid level farmers extend their commercialization reach by strategically sending containers of fresh produce items to secondary markets that present temporary arbitrage opportunities. The methodology aims at identifying time windows of opportunity in which the price differential between two markets create

This thesis develops a low-investment marketing strategy that allows low-to-mid level farmers extend their commercialization reach by strategically sending containers of fresh produce items to secondary markets that present temporary arbitrage opportunities. The methodology aims at identifying time windows of opportunity in which the price differential between two markets create an arbitrage opportunity for a transaction; a transaction involves buying a fresh produce item at a base market, and then shipping and selling it at secondary market price. A decision-making tool is developed that gauges the individual arbitrage opportunities and determines the specific price differential (or threshold level) that is most beneficial to the farmer under particular market conditions. For this purpose, two approaches are developed; a pragmatic approach that uses historic price information of the products in order to find the optimal price differential that maximizes earnings, and a theoretical one, which optimizes an expected profit model of the shipments to identify this optimal threshold. This thesis also develops risk management strategies that further reduce profit variability during a particular two-market transaction. In this case, financial engineering concepts are used to determine a shipment configuration strategy that minimizes the overall variability of the profits. For this, a Markowitz model is developed to determine the weight assignation of each component for a particular shipment. Based on the results of the analysis, it is deemed possible to formulate a shipment policy that not only increases the farmer's commercialization reach, but also produces profitable operations. In general, the observed rates of return under a pragmatic and theoretical approach hovered between 0.072 and 0.616 within important two-market structures. Secondly, it is demonstrated that the level of return and risk can be manipulated by varying the strictness of the shipping policy to meet the overall objectives of the decision-maker. Finally, it was found that one can minimize the risk of a particular two-market transaction by strategically grouping the product shipments.
ContributorsFlores, Hector M (Author) / Villalobos, Rene (Thesis advisor) / Runger, George C. (Committee member) / Maltz, Arnold (Committee member) / Arizona State University (Publisher)
Created2011
149723-Thumbnail Image.png
Description
This dissertation transforms a set of system complexity reduction problems to feature selection problems. Three systems are considered: classification based on association rules, network structure learning, and time series classification. Furthermore, two variable importance measures are proposed to reduce the feature selection bias in tree models. Associative classifiers can achieve

This dissertation transforms a set of system complexity reduction problems to feature selection problems. Three systems are considered: classification based on association rules, network structure learning, and time series classification. Furthermore, two variable importance measures are proposed to reduce the feature selection bias in tree models. Associative classifiers can achieve high accuracy, but the combination of many rules is difficult to interpret. Rule condition subset selection (RCSS) methods for associative classification are considered. RCSS aims to prune the rule conditions into a subset via feature selection. The subset then can be summarized into rule-based classifiers. Experiments show that classifiers after RCSS can substantially improve the classification interpretability without loss of accuracy. An ensemble feature selection method is proposed to learn Markov blankets for either discrete or continuous networks (without linear, Gaussian assumptions). The method is compared to a Bayesian local structure learning algorithm and to alternative feature selection methods in the causal structure learning problem. Feature selection is also used to enhance the interpretability of time series classification. Existing time series classification algorithms (such as nearest-neighbor with dynamic time warping measures) are accurate but difficult to interpret. This research leverages the time-ordering of the data to extract features, and generates an effective and efficient classifier referred to as a time series forest (TSF). The computational complexity of TSF is only linear in the length of time series, and interpretable features can be extracted. These features can be further reduced, and summarized for even better interpretability. Lastly, two variable importance measures are proposed to reduce the feature selection bias in tree-based ensemble models. It is well known that bias can occur when predictor attributes have different numbers of values. Two methods are proposed to solve the bias problem. One uses an out-of-bag sampling method called OOBForest, and the other, based on the new concept of a partial permutation test, is called a pForest. Experimental results show the existing methods are not always reliable for multi-valued predictors, while the proposed methods have advantages.
ContributorsDeng, Houtao (Author) / Runger, George C. (Thesis advisor) / Lohr, Sharon L (Committee member) / Pan, Rong (Committee member) / Zhang, Muhong (Committee member) / Arizona State University (Publisher)
Created2011
149913-Thumbnail Image.png
Description
One necessary condition for the two-pass risk premium estimator to be consistent and asymptotically normal is that the rank of the beta matrix in a proposed linear asset pricing model is full column. I first investigate the asymptotic properties of the risk premium estimators and the related t-test and

One necessary condition for the two-pass risk premium estimator to be consistent and asymptotically normal is that the rank of the beta matrix in a proposed linear asset pricing model is full column. I first investigate the asymptotic properties of the risk premium estimators and the related t-test and Wald test statistics when the full rank condition fails. I show that the beta risk of useless factors or multiple proxy factors for a true factor are priced more often than they should be at the nominal size in the asset pricing models omitting some true factors. While under the null hypothesis that the risk premiums of the true factors are equal to zero, the beta risk of the true factors are priced less often than the nominal size. The simulation results are consistent with the theoretical findings. Hence, the factor selection in a proposed factor model should not be made solely based on their estimated risk premiums. In response to this problem, I propose an alternative estimation of the underlying factor structure. Specifically, I propose to use the linear combination of factors weighted by the eigenvectors of the inner product of estimated beta matrix. I further propose a new method to estimate the rank of the beta matrix in a factor model. For this method, the idiosyncratic components of asset returns are allowed to be correlated both over different cross-sectional units and over different time periods. The estimator I propose is easy to use because it is computed with the eigenvalues of the inner product of an estimated beta matrix. Simulation results show that the proposed method works well even in small samples. The analysis of US individual stock returns suggests that there are six common risk factors in US individual stock returns among the thirteen factor candidates used. The analysis of portfolio returns reveals that the estimated number of common factors changes depending on how the portfolios are constructed. The number of risk sources found from the analysis of portfolio returns is generally smaller than the number found in individual stock returns.
ContributorsWang, Na (Author) / Ahn, Seung C. (Thesis advisor) / Kallberg, Jarl G. (Committee member) / Liu, Crocker H. (Committee member) / Arizona State University (Publisher)
Created2011
155725-Thumbnail Image.png
Description
Random forest (RF) is a popular and powerful technique nowadays. It can be used for classification, regression and unsupervised clustering. In its original form introduced by Leo Breiman, RF is used as a predictive model to generate predictions for new observations. Recent researches have proposed several methods based on RF

Random forest (RF) is a popular and powerful technique nowadays. It can be used for classification, regression and unsupervised clustering. In its original form introduced by Leo Breiman, RF is used as a predictive model to generate predictions for new observations. Recent researches have proposed several methods based on RF for feature selection and for generating prediction intervals. However, they are limited in their applicability and accuracy. In this dissertation, RF is applied to build a predictive model for a complex dataset, and used as the basis for two novel methods for biomarker discovery and generating prediction interval.

Firstly, a biodosimetry is developed using RF to determine absorbed radiation dose from gene expression measured from blood samples of potentially exposed individuals. To improve the prediction accuracy of the biodosimetry, day-specific models were built to deal with day interaction effect and a technique of nested modeling was proposed. The nested models can fit this complex data of large variability and non-linear relationships.

Secondly, a panel of biomarkers was selected using a data-driven feature selection method as well as handpick, considering prior knowledge and other constraints. To incorporate domain knowledge, a method called Know-GRRF was developed based on guided regularized RF. This method can incorporate domain knowledge as a penalized term to regulate selection of candidate features in RF. It adds more flexibility to data-driven feature selection and can improve the interpretability of models. Know-GRRF showed significant improvement in cross-species prediction when cross-species correlation was used to guide selection of biomarkers. The method can also compete with existing methods using intrinsic data characteristics as alternative of domain knowledge in simulated datasets.

Lastly, a novel non-parametric method, RFerr, was developed to generate prediction interval using RF regression. This method is widely applicable to any predictive models and was shown to have better coverage and precision than existing methods on the real-world radiation dataset, as well as benchmark and simulated datasets.
ContributorsGuan, Xin (Author) / Liu, Li (Thesis advisor) / Runger, George C. (Thesis advisor) / Dinu, Valentin (Committee member) / Arizona State University (Publisher)
Created2017
149443-Thumbnail Image.png
Description
Public health surveillance is a special case of the general problem where counts (or rates) of events are monitored for changes. Modern data complements event counts with many additional measurements (such as geographic, demographic, and others) that comprise high-dimensional covariates. This leads to an important challenge to detect a change

Public health surveillance is a special case of the general problem where counts (or rates) of events are monitored for changes. Modern data complements event counts with many additional measurements (such as geographic, demographic, and others) that comprise high-dimensional covariates. This leads to an important challenge to detect a change that only occurs within a region, initially unspecified, defined by these covariates. Current methods are typically limited to spatial and/or temporal covariate information and often fail to use all the information available in modern data that can be paramount in unveiling these subtle changes. Additional complexities associated with modern health data that are often not accounted for by traditional methods include: covariates of mixed type, missing values, and high-order interactions among covariates. This work proposes a transform of public health surveillance to supervised learning, so that an appropriate learner can inherently address all the complexities described previously. At the same time, quantitative measures from the learner can be used to define signal criteria to detect changes in rates of events. A Feature Selection (FS) method is used to identify covariates that contribute to a model and to generate a signal. A measure of statistical significance is included to control false alarms. An alternative Percentile method identifies the specific cases that lead to changes using class probability estimates from tree-based ensembles. This second method is intended to be less computationally intensive and significantly simpler to implement. Finally, a third method labeled Rule-Based Feature Value Selection (RBFVS) is proposed for identifying the specific regions in high-dimensional space where the changes are occurring. Results on simulated examples are used to compare the FS method and the Percentile method. Note this work emphasizes the application of the proposed methods on public health surveillance. Nonetheless, these methods can easily be extended to a variety of applications where counts (or rates) of events are monitored for changes. Such problems commonly occur in domains such as manufacturing, economics, environmental systems, engineering, as well as in public health.
ContributorsDavila, Saylisse (Author) / Runger, George C. (Thesis advisor) / Montgomery, Douglas C. (Committee member) / Young, Dennis (Committee member) / Gel, Esma (Committee member) / Arizona State University (Publisher)
Created2010