Matching Items (3)
Filtering by

Clear all filters

152189-Thumbnail Image.png
Description
This work presents two complementary studies that propose heuristic methods to capture characteristics of data using the ensemble learning method of random forest. The first study is motivated by the problem in education of determining teacher effectiveness in student achievement. Value-added models (VAMs), constructed as linear mixed models, use students’

This work presents two complementary studies that propose heuristic methods to capture characteristics of data using the ensemble learning method of random forest. The first study is motivated by the problem in education of determining teacher effectiveness in student achievement. Value-added models (VAMs), constructed as linear mixed models, use students’ test scores as outcome variables and teachers’ contributions as random effects to ascribe changes in student performance to the teachers who have taught them. The VAMs teacher score is the empirical best linear unbiased predictor (EBLUP). This approach is limited by the adequacy of the assumed model specification with respect to the unknown underlying model. In that regard, this study proposes alternative ways to rank teacher effects that are not dependent on a given model by introducing two variable importance measures (VIMs), the node-proportion and the covariate-proportion. These VIMs are novel because they take into account the final configuration of the terminal nodes in the constitutive trees in a random forest. In a simulation study, under a variety of conditions, true rankings of teacher effects are compared with estimated rankings obtained using three sources: the newly proposed VIMs, existing VIMs, and EBLUPs from the assumed linear model specification. The newly proposed VIMs outperform all others in various scenarios where the model was misspecified. The second study develops two novel interaction measures. These measures could be used within but are not restricted to the VAM framework. The distribution-based measure is constructed to identify interactions in a general setting where a model specification is not assumed in advance. In turn, the mean-based measure is built to estimate interactions when the model specification is assumed to be linear. Both measures are unique in their construction; they take into account not only the outcome values, but also the internal structure of the trees in a random forest. In a separate simulation study, under a variety of conditions, the proposed measures are found to identify and estimate second-order interactions.
ContributorsValdivia, Arturo (Author) / Eubank, Randall (Thesis advisor) / Young, Dennis (Committee member) / Reiser, Mark R. (Committee member) / Kao, Ming-Hung (Committee member) / Broatch, Jennifer (Committee member) / Arizona State University (Publisher)
Created2013
151511-Thumbnail Image.png
Description
With the increase in computing power and availability of data, there has never been a greater need to understand data and make decisions from it. Traditional statistical techniques may not be adequate to handle the size of today's data or the complexities of the information hidden within the data. Thus

With the increase in computing power and availability of data, there has never been a greater need to understand data and make decisions from it. Traditional statistical techniques may not be adequate to handle the size of today's data or the complexities of the information hidden within the data. Thus knowledge discovery by machine learning techniques is necessary if we want to better understand information from data. In this dissertation, we explore the topics of asymmetric loss and asymmetric data in machine learning and propose new algorithms as solutions to some of the problems in these topics. We also studied variable selection of matched data sets and proposed a solution when there is non-linearity in the matched data. The research is divided into three parts. The first part addresses the problem of asymmetric loss. A proposed asymmetric support vector machine (aSVM) is used to predict specific classes with high accuracy. aSVM was shown to produce higher precision than a regular SVM. The second part addresses asymmetric data sets where variables are only predictive for a subset of the predictor classes. Asymmetric Random Forest (ARF) was proposed to detect these kinds of variables. The third part explores variable selection for matched data sets. Matched Random Forest (MRF) was proposed to find variables that are able to distinguish case and control without the restrictions that exists in linear models. MRF detects variables that are able to distinguish case and control even in the presence of interaction and qualitative variables.
ContributorsKoh, Derek (Author) / Runger, George C. (Thesis advisor) / Wu, Tong (Committee member) / Pan, Rong (Committee member) / Cesta, John (Committee member) / Arizona State University (Publisher)
Created2013
133732-Thumbnail Image.png
Description
As threats to Earth's biodiversity continue to evolve, an effective methodology to predict such threats is crucial to ensure the survival of living species. Organizations like the International Union for Conservation of Nature (IUCN) monitor the Earth's environmental networks to preserve the sanctity of terrestrial and marine life. The IUCN

As threats to Earth's biodiversity continue to evolve, an effective methodology to predict such threats is crucial to ensure the survival of living species. Organizations like the International Union for Conservation of Nature (IUCN) monitor the Earth's environmental networks to preserve the sanctity of terrestrial and marine life. The IUCN Red List of Threatened Species informs the conservation activities of governments as a world standard of species' risks of extinction. However, the IUCN's current methodology is, in some ways, inefficient given the immense volume of Earth's species and the laboriousness of its species' risk classification process. IUCN assessors can take years to classify a species' extinction risk, even as that species continues to decline. Therefore, to supplement the IUCN's classification process and thus bolster conservationist efforts for threatened species, a Random Forest model was constructed, trained on a group of fish species previously classified by the IUCN Red List. This Random Forest model both validates the IUCN Red List's classification method and offers a highly efficient, supplemental classification method for species' extinction risk. In addition, this Random Forest model is applicable to species with deficient data, which the IUCN Red List is otherwise unable to classify, thus engendering conservationist efforts for previously obscure species. Although this Random Forest model is built specifically for the trained fish species (Sparidae), the methodology can and should be extended to additional species.
ContributorsWoodyard, Megan (Author) / Broatch, Jennifer (Thesis director) / Polidoro, Beth (Committee member) / Mancenido, Michelle (Committee member) / School of Humanities, Arts, and Cultural Studies (Contributor) / School of Mathematical and Natural Sciences (Contributor) / College of Integrative Sciences and Arts (Contributor) / Barrett, The Honors College (Contributor)
Created2018-05