Search Content

Learning from asymmetric models and matched pairs

Description

With the increase in computing power and availability of data, there has never been a greater need to understand data and make decisions from it. Traditional statistical techniques may not be adequate to handle the size of today's data or the complexities of the information hidden within the data. Thus…

With the increase in computing power and availability of data, there has never been a greater need to understand data and make decisions from it. Traditional statistical techniques may not be adequate to handle the size of today's data or the complexities of the information hidden within the data. Thus knowledge discovery by machine learning techniques is necessary if we want to better understand information from data. In this dissertation, we explore the topics of asymmetric loss and asymmetric data in machine learning and propose new algorithms as solutions to some of the problems in these topics. We also studied variable selection of matched data sets and proposed a solution when there is non-linearity in the matched data. The research is divided into three parts. The first part addresses the problem of asymmetric loss. A proposed asymmetric support vector machine (aSVM) is used to predict specific classes with high accuracy. aSVM was shown to produce higher precision than a regular SVM. The second part addresses asymmetric data sets where variables are only predictive for a subset of the predictor classes. Asymmetric Random Forest (ARF) was proposed to detect these kinds of variables. The third part explores variable selection for matched data sets. Matched Random Forest (MRF) was proposed to find variables that are able to distinguish case and control without the restrictions that exists in linear models. MRF detects variables that are able to distinguish case and control even in the presence of interaction and qualitative variables.

ContributorsKoh, Derek (Author) / Runger, George C. (Thesis advisor) / Wu, Tong (Committee member) / Pan, Rong (Committee member) / Cesta, John (Committee member) / Arizona State University (Publisher)

Created2013

Holistic learning for multi-target and network monitoring problems

Description

Technological advances have enabled the generation and collection of various data from complex systems, thus, creating ample opportunity to integrate knowledge in many decision making applications. This dissertation introduces holistic learning as the integration of a comprehensive set of relationships that are used towards the learning objective. The holistic view…

Technological advances have enabled the generation and collection of various data from complex systems, thus, creating ample opportunity to integrate knowledge in many decision making applications. This dissertation introduces holistic learning as the integration of a comprehensive set of relationships that are used towards the learning objective. The holistic view of the problem allows for richer learning from data and, thereby, improves decision making.

The first topic of this dissertation is the prediction of several target attributes using a common set of predictor attributes. In a holistic learning approach, the relationships between target attributes are embedded into the learning algorithm created in this dissertation. Specifically, a novel tree based ensemble that leverages the relationships between target attributes towards constructing a diverse, yet strong, model is proposed. The method is justified through its connection to existing methods and experimental evaluations on synthetic and real data.

The second topic pertains to monitoring complex systems that are modeled as networks. Such systems present a rich set of attributes and relationships for which holistic learning is important. In social networks, for example, in addition to friendship ties, various attributes concerning the users' gender, age, topic of messages, time of messages, etc. are collected. A restricted form of monitoring fails to take the relationships of multiple attributes into account, whereas the holistic view embeds such relationships in the monitoring methods. The focus is on the difficult task to detect a change that might only impact a small subset of the network and only occur in a sub-region of the high-dimensional space of the network attributes. One contribution is a monitoring algorithm based on a network statistical model. Another contribution is a transactional model that transforms the task into an expedient structure for machine learning, along with a generalizable algorithm to monitor the attributed network. A learning step in this algorithm adapts to changes that may only be local to sub-regions (with a broader potential for other learning tasks). Diagnostic tools to interpret the change are provided. This robust, generalizable, holistic monitoring method is elaborated on synthetic and real networks.

ContributorsAzarnoush, Bahareh (Author) / Runger, George C. (Thesis advisor) / Bekki, Jennifer (Thesis advisor) / Pan, Rong (Committee member) / Saghafian, Soroush (Committee member) / Arizona State University (Publisher)

Created2014

Modeling time series data for supervised learning

Description

Temporal data are increasingly prevalent and important in analytics. Time series (TS) data are chronological sequences of observations and an important class of temporal data. Fields such as medicine, finance, learning science and multimedia naturally generate TS data. Each series provide a high-dimensional data vector that challenges the learning of…

Temporal data are increasingly prevalent and important in analytics. Time series (TS) data are chronological sequences of observations and an important class of temporal data. Fields such as medicine, finance, learning science and multimedia naturally generate TS data. Each series provide a high-dimensional data vector that challenges the learning of the relevant patterns This dissertation proposes TS representations and methods for supervised TS analysis. The approaches combine new representations that handle translations and dilations of patterns with bag-of-features strategies and tree-based ensemble learning. This provides flexibility in handling time-warped patterns in a computationally efficient way. The ensemble learners provide a classification framework that can handle high-dimensional feature spaces, multiple classes and interaction between features. The proposed representations are useful for classification and interpretation of the TS data of varying complexity. The first contribution handles the problem of time warping with a feature-based approach. An interval selection and local feature extraction strategy is proposed to learn a bag-of-features representation. This is distinctly different from common similarity-based time warping. This allows for additional features (such as pattern location) to be easily integrated into the models. The learners have the capability to account for the temporal information through the recursive partitioning method. The second contribution focuses on the comprehensibility of the models. A new representation is integrated with local feature importance measures from tree-based ensembles, to diagnose and interpret time intervals that are important to the model. Multivariate time series (MTS) are especially challenging because the input consists of a collection of TS and both features within TS and interactions between TS can be important to models. Another contribution uses a different representation to produce computationally efficient strategies that learn a symbolic representation for MTS. Relationships between the multiple TS, nominal and missing values are handled with tree-based learners. Applications such as speech recognition, medical diagnosis and gesture recognition are used to illustrate the methods. Experimental results show that the TS representations and methods provide better results than competitive methods on a comprehensive collection of benchmark datasets. Moreover, the proposed approaches naturally provide solutions to similarity analysis, predictive pattern discovery and feature selection.

ContributorsBaydogan, Mustafa Gokce (Author) / Runger, George C. (Thesis advisor) / Atkinson, Robert (Committee member) / Gel, Esma (Committee member) / Pan, Rong (Committee member) / Arizona State University (Publisher)

Created2012

Optimal design of experiments for functional responses

Description

Functional or dynamic responses are prevalent in experiments in the fields of engineering, medicine, and the sciences, but proposals for optimal designs are still sparse for this type of response. Experiments with dynamic responses result in multiple responses taken over a spectrum variable, so the design matrix for a dynamic…

Functional or dynamic responses are prevalent in experiments in the fields of engineering, medicine, and the sciences, but proposals for optimal designs are still sparse for this type of response. Experiments with dynamic responses result in multiple responses taken over a spectrum variable, so the design matrix for a dynamic response have more complicated structures. In the literature, the optimal design problem for some functional responses has been solved using genetic algorithm (GA) and approximate design methods. The goal of this dissertation is to develop fast computer algorithms for calculating exact D-optimal designs.

First, we demonstrated how the traditional exchange methods could be improved to generate a computationally efficient algorithm for finding G-optimal designs. The proposed two-stage algorithm, which is called the cCEA, uses a clustering-based approach to restrict the set of possible candidates for PEA, and then improves the G-efficiency using CEA.

The second major contribution of this dissertation is the development of fast algorithms for constructing D-optimal designs that determine the optimal sequence of stimuli in fMRI studies. The update formula for the determinant of the information matrix was improved by exploiting the sparseness of the information matrix, leading to faster computation times. The proposed algorithm outperforms genetic algorithm with respect to computational efficiency and D-efficiency.

The third contribution is a study of optimal experimental designs for more general functional response models. First, the B-spline system is proposed to be used as the non-parametric smoother of response function and an algorithm is developed to determine D-optimal sampling points of a spectrum variable. Second, we proposed a two-step algorithm for finding the optimal design for both sampling points and experimental settings. In the first step, the matrix of experimental settings is held fixed while the algorithm optimizes the determinant of the information matrix for a mixed effects model to find the optimal sampling times. In the second step, the optimal sampling times obtained from the first step is held fixed while the algorithm iterates on the information matrix to find the optimal experimental settings. The designs constructed by this approach yield superior performance over other designs found in literature.

ContributorsSaleh, Moein (Author) / Pan, Rong (Thesis advisor) / Montgomery, Douglas C. (Committee member) / Runger, George C. (Committee member) / Kao, Ming-Hung (Committee member) / Arizona State University (Publisher)

Created2015

Machine Learning Models for High-dimensional Biomedical Data

Description

The recent technological advances enable the collection of various complex, heterogeneous and high-dimensional data in biomedical domains. The increasing availability of the high-dimensional biomedical data creates the needs of new machine learning models for effective data analysis and knowledge discovery. This dissertation introduces several unsupervised and supervised methods to hel…

The recent technological advances enable the collection of various complex, heterogeneous and high-dimensional data in biomedical domains. The increasing availability of the high-dimensional biomedical data creates the needs of new machine learning models for effective data analysis and knowledge discovery. This dissertation introduces several unsupervised and supervised methods to help understand the data, discover the patterns and improve the decision making. All the proposed methods can generalize to other industrial fields.

The first topic of this dissertation focuses on the data clustering. Data clustering is often the first step for analyzing a dataset without the label information. Clustering high-dimensional data with mixed categorical and numeric attributes remains a challenging, yet important task. A clustering algorithm based on tree ensembles, CRAFTER, is proposed to tackle this task in a scalable manner.

The second part of this dissertation aims to develop data representation methods for genome sequencing data, a special type of high-dimensional data in the biomedical domain. The proposed data representation method, Bag-of-Segments, can summarize the key characteristics of the genome sequence into a small number of features with good interpretability.

The third part of this dissertation introduces an end-to-end deep neural network model, GCRNN, for time series classification with emphasis on both the accuracy and the interpretation. GCRNN contains a convolutional network component to extract high-level features, and a recurrent network component to enhance the modeling of the temporal characteristics. A feed-forward fully connected network with the sparse group lasso regularization is used to generate the final classification and provide good interpretability.

The last topic centers around the dimensionality reduction methods for time series data. A good dimensionality reduction method is important for the storage, decision making and pattern visualization for time series data. The CRNN autoencoder is proposed to not only achieve low reconstruction error, but also generate discriminative features. A variational version of this autoencoder has great potential for applications such as anomaly detection and process control.

ContributorsLin, Sangdi (Author) / Runger, George C. (Thesis advisor) / Kocher, Jean-Pierre A (Committee member) / Pan, Rong (Committee member) / Escobedo, Adolfo R. (Committee member) / Arizona State University (Publisher)

Created2018

Player Optimization in the National Football League: Creating a Winning Franchise

Description

The NFL is one of largest and most influential industries in the world. In America there are few companies that have a stronger hold on the American culture and create such a phenomena from year to year. In this project aimed to develop a strategy that helps an NFL team…

The NFL is one of largest and most influential industries in the world. In America there are few companies that have a stronger hold on the American culture and create such a phenomena from year to year. In this project aimed to develop a strategy that helps an NFL team be as successful as possible by defining which positions are most important to a team's success. Data from fifteen years of NFL games was collected and information on every player in the league was analyzed. First there needed to be a benchmark which describes a team as being average and then every player in the NFL must be compared to that average. Based on properties of linear regression using ordinary least squares this project aims to define such a model that shows each position's importance. Finally, once such a model had been established then the focus turned to the NFL draft in which the goal was to find a strategy of where each position needs to be drafted so that it is most likely to give the best payoff based on the results of the regression in part one.

ContributorsBalzer, Kevin Ryan (Author) / Goegan, Brian (Thesis director) / Dassanayake, Maduranga (Committee member) / Barrett, The Honors College (Contributor) / Economics Program in CLAS (Contributor) / School of Mathematical and Statistical Sciences (Contributor)

Created2015-05

Relationship Between College Baseball Conferences and Average Offensive Production of Major League Baseball Players

Description

Beginning with the publication of Moneyball by Michael Lewis in 2003, the use of sabermetrics \u2014 the application of statistical analysis to baseball records - has exploded in major league front offices. Executives Billy Beane, Paul DePoedesta, and Theo Epstein are notable figures that have been successful in incorporating sabermetrics…

Beginning with the publication of Moneyball by Michael Lewis in 2003, the use of sabermetrics \u2014 the application of statistical analysis to baseball records - has exploded in major league front offices. Executives Billy Beane, Paul DePoedesta, and Theo Epstein are notable figures that have been successful in incorporating sabermetrics to their team's philosophy, resulting in playoff appearances and championship success. The competitive market of baseball, once dominated by the collusion of owners, now promotes innovative thought to analytically develop competitive advantages. The tiered economic payrolls of Major League Baseball (MLB) has created an environment in which large-market teams are capable of "buying" championships through the acquisition of the best available talent in free agency, and small-market teams are pushed to "build" championships through the drafting and systematic farming of high-school and college level players. The use of sabermetrics promotes both models of success \u2014 buying and building \u2014 by unbiasedly determining a player's productivity. The objective of this paper is to develop a regression-based predictive model that can be used by Majors League Baseball teams to forecast the MLB career average offensive performance of college baseball players from specific conferences. The development of this model required multiple tasks: I. Data was obtained from The Baseball Cube, a baseball records database providing both College and MLB data. II. Modifications to the data were applied to adjust for year-to-year formatting, a missing variable for seasons played, the presence of missing values, and to correct league identifiers. III. Evaluation of multiple offensive productivity models capable of handling the obtained dataset and regression forecasting technique. IV. SAS software was used to create the regression models and analyze the residuals for any irregularities or normality violations. The results of this paper find that there is a relationship between Division 1 collegiate baseball conferences and average career offensive productivity in Major Leagues Baseball, with the SEC having the most accurate reflection of performance.

ContributorsBadger, Mathew Bernard (Author) / Goegan, Brian (Thesis director) / Eaton, John (Committee member) / Department of Economics (Contributor) / Department of Marketing (Contributor) / Barrett, The Honors College (Contributor)

Created2017-05

The Economic Impact of the Opioid Crisis in the United States

Description

This study examines the economic impact of the opioid crisis in the United States. Primarily testing the years 2007-2018, I gathered data from the Census Bureau, Centers for Disease Control, and Kaiser Family Foundation in order to examine the relative impact of a one dollar increase in GDP per Capita…

This study examines the economic impact of the opioid crisis in the United States. Primarily testing the years 2007-2018, I gathered data from the Census Bureau, Centers for Disease Control, and Kaiser Family Foundation in order to examine the relative impact of a one dollar increase in GDP per Capita on the death rates caused by opioids. By implementing a fixed-effects panel data design, I regressed deaths on GDP per Capita while holding the following constant: population, U.S. retail opioid prescriptions per 100 people, annual average unemployment rate, percent of the population that is Caucasian, and percent of the population that is male. I found that GDP per Capita and opioid related deaths are negatively correlated, meaning that with every additional person dying from opioids, GDP per capita decreases. The finding of this research is important because opioid overdose is harmful to society, as U.S. life expectancy is consistently dropping as opioid death rates rise. Increasing awareness on this topic can help prevent misuse and the overall reduction in opioid related deaths.

ContributorsRavi, Ritika Lisa (Author) / Goegan, Brian (Thesis director) / Hill, John (Committee member) / Department of Economics (Contributor) / Department of Information Systems (Contributor) / Barrett, The Honors College (Contributor)

Created2019-05

Analytics of the Prospect Draft in Major League Baseball

Description

Our research encompassed the prospect draft in baseball and looked at what type of player teams drafted to maximize value. We wanted to know which position returned the best value to the team that drafted them, and which level is safer to draft players from, college or high school. We…

Our research encompassed the prospect draft in baseball and looked at what type of player teams drafted to maximize value. We wanted to know which position returned the best value to the team that drafted them, and which level is safer to draft players from, college or high school. We decided to look at draft data from 2006-2010 for the first ten rounds of players selected. Because there is only a monetary cap on players drafted in the first ten rounds we restricted our data to these players. Once we set up the parameters we compiled a spreadsheet of these players with both their signing bonuses and their wins above replacement (WAR). This allowed us to see how much a team was spending per win at the major league level. After the data was compiled we made pivot tables and graphs to visually represent our data and better understand the numbers. We found that the worst position that MLB teams could draft would be high school second baseman. They returned the lowest WAR of any player that we looked at. In general though high school players were more costly to sign and had lower WARs than their college counterparts making them, on average, a worse pick value wise. The best position you could pick was college shortstops. They had the trifecta of the best signability of all players, along with one of the highest WARs and lowest signing bonuses. These were three of the main factors that you want with your draft pick and they ranked near the top in all three categories. This research can help give guidelines to Major League teams as they go to select players in the draft. While there are always going to be exceptions to trends, by following the enclosed research teams can minimize risk in the draft.

ContributorsValentine, Robert (Co-author) / Johnson, Ben (Co-author) / Eaton, John (Thesis director) / Goegan, Brian (Committee member) / Department of Finance (Contributor) / Department of Economics (Contributor) / Department of Information Systems (Contributor) / School of Accountancy (Contributor) / Barrett, The Honors College (Contributor)

Created2017-05

A Cross-State Longitudinal Study of Opioid-Related Deaths Associated with Opioid and Naloxone Prescribing and Prevention Laws

Description

More than 40% of all U.S. opioid overdose deaths in 2016 involved a prescription opioid, with more than 46 people dying every day from overdoses involving prescription opioids, (CDC, 2017). Over the years, lawmakers have implemented policies and laws to address the opioid epidemic, and many of these vary from…

More than 40% of all U.S. opioid overdose deaths in 2016 involved a prescription opioid, with more than 46 people dying every day from overdoses involving prescription opioids, (CDC, 2017). Over the years, lawmakers have implemented policies and laws to address the opioid epidemic, and many of these vary from state to state. This study will lay out the basic guidelines of common pieces of legislation. It also examines relationships between 6 state-specific prescribing or preventative laws and associated changes in opioid-related deaths using a longitudinal cross-state study design (2007-2015). Specifically, it uses a linear regression to examine changes in state-specific rates of opioid-related deaths after implementation of specific policies, and whether states implementing these policies saw smaller increases than states without these policies. Initial key findings of this study show that three policies have a statistically significant association with opioid related overdose deaths are—Good Samaritan Laws, Standing Order Laws, and Naloxone Liability Laws. Paradoxically, all three policies correlated with an increase in opioid overdose deaths between 2007 and 2016. However, after correcting for the potential spurious relationship between state-specific timing of policy implementation and death rates, two policies have a statistically significant association (alpha <0.05) with opioid overdose death rates. First, the Naloxone Liability Laws were significantly associated with changes in opioid-related deaths and was correlated with a 0.33 log increase in opioid overdose death rates, or a 29% increase. This equates to about 1.39 more deaths per year per 100,000 people. Second, the legislation that allows for 3rd Party Naloxone prescriptions correlated with a 0.33 log decrease in opioid overdose death rates, or a 29% decrease. This equates to 1.39 fewer deaths per year per 100,000 people.

ContributorsDavis, Joshua Alan (Author) / Hruschka, Daniel (Thesis director) / Gaughan, Monica (Committee member) / School of Human Evolution & Social Change (Contributor) / Barrett, The Honors College (Contributor)

Created2019-05

Filtering by