Search Content

Case Studies in Machine Learning of Reduced Form Models for Causal Inference

Description

This dissertation develops versatile modeling tools to estimate causal effects when conditional unconfoundedness is not immediately satisfied. Chapter 2 provides a brief overview ofcommon techniques in causal inference, with a focus on models relevant to the data explored in later chapters. The rest of the dissertation focuses on the development of…

This dissertation develops versatile modeling tools to estimate causal effects when conditional unconfoundedness is not immediately satisfied. Chapter 2 provides a brief overview ofcommon techniques in causal inference, with a focus on models relevant to the data explored in later chapters. The rest of the dissertation focuses on the development of novel “reduced form” models which are designed to assess the particular challenges of different datasets. Chapter 3 explores the question of whether or not forecasts of bankruptcy cause bankruptcy. The question arises from the observation that companies issued going concern opinions were more likely to go bankrupt in the following year, leading people to speculate that the opinions themselves caused the bankruptcy via a “self-fulfilling prophecy”. A Bayesian machine learning sensitivity analysis is developed to answer this question. In exchange for additional flexibility and fewer assumptions, this approach loses point identification of causal effects and thus a sensitivity analysis is developed to study a wide range of plausible scenarios of the causal effect of going concern opinions on bankruptcy. Reported in the simulations are different performance metrics of the model in comparison with other popular methods and a robust analysis of the sensitivity of the model to mis-specification. Results on empirical data indicate that forecasts of bankruptcies likely do have a small causal effect. Chapter 4 studies the effects of vaccination on COVID-19 mortality at the state level in the United States. The dynamic nature of the pandemic complicates more straightforward regression adjustments and invalidates many alternative models. The chapter comments on the limitations of mechanistic approaches as well as traditional statistical methods to epidemiological data. Instead, a state space model is developed that allows the study of the ever-changing dynamics of the pandemic’s progression. In the first stage, the model decomposes the observed mortality data into component surges, and later uses this information in a semi-parametric regression model for causal analysis. Results are investigated thoroughly for empirical justification and stress-tested in simulated settings.

ContributorsPapakostas, Demetrios (Author) / Hahn, Paul (Thesis advisor) / McCulloch, Robert (Committee member) / Zhou, Shuang (Committee member) / Kao, Ming-Hung (Committee member) / Lan, Shiwei (Committee member) / Arizona State University (Publisher)

Created2023

US Forest Fire Size Prediction using Machine Learning

Description

The number of extreme wildfires is on the rise globally, and predicting the size of a fire will help officials make appropriate decisions to mitigate the risk the fire poses against the environment and humans. This study attempts to find the burned area of fires in the United States based…

The number of extreme wildfires is on the rise globally, and predicting the size of a fire will help officials make appropriate decisions to mitigate the risk the fire poses against the environment and humans. This study attempts to find the burned area of fires in the United States based on attributes such as time, weather, and location of the fire using machine learning methods.

ContributorsPrabagaran, Padma (Author, Co-author) / Meuth, Ryan (Thesis director) / McCulloch, Robert (Committee member) / Barrett, The Honors College (Contributor) / Computer Science and Engineering Program (Contributor) / School of Mathematical and Statistical Sciences (Contributor)

Created2022-12

Computational Challenges in BART Modeling: Extrapolation, Classification, and Causal Inference

Description

This dissertation centers on Bayesian Additive Regression Trees (BART) and Accelerated BART (XBART) and presents a series of models that tackle extrapolation, classification, and causal inference challenges. To improve extrapolation in tree-based models, I propose a method called local Gaussian Process (GP) that combines Gaussian process regression with trained BART…

This dissertation centers on Bayesian Additive Regression Trees (BART) and Accelerated BART (XBART) and presents a series of models that tackle extrapolation, classification, and causal inference challenges. To improve extrapolation in tree-based models, I propose a method called local Gaussian Process (GP) that combines Gaussian process regression with trained BART trees. This allows for extrapolation based on the most relevant data points and covariate variables determined by the trees' structure. The local GP technique is extended to the Bayesian causal forest (BCF) models to address the positivity violation issue in causal inference. Additionally, I introduce the LongBet model to estimate time-varying, heterogeneous treatment effects in panel data. Furthermore, I present a Poisson-based model, with a modified likelihood for XBART for the multi-class classification problem.

ContributorsWang, Meijia (Author) / Hahn, Paul (Thesis advisor) / He, Jingyu (Committee member) / Lan, Shiwei (Committee member) / McCulloch, Robert (Committee member) / Zhou, Shuang (Committee member) / Arizona State University (Publisher)

Created2024

Time-dependent models of signal transduction networks

Description

Signaling cascades transduce signals received on the cell membrane to the nucleus. While noise filtering, ultra-sensitive switches, and signal amplification have all been shown to be features of such signaling cascades, it is not understood why cascades typically show three or four layers. Using singular perturbation theory, Michaelis-Menten type equations…

Signaling cascades transduce signals received on the cell membrane to the nucleus. While noise filtering, ultra-sensitive switches, and signal amplification have all been shown to be features of such signaling cascades, it is not understood why cascades typically show three or four layers. Using singular perturbation theory, Michaelis-Menten type equations are derived for open enzymatic systems. When these equations are organized into a cascade, it is demonstrated that the output signal as a function of time becomes sigmoidal with the addition of more layers. Furthermore, it is shown that the activation time will speed up to a point, after which more layers become superfluous. It is shown that three layers create a reliable sigmoidal response progress curve from a wide variety of time-dependent signaling inputs arriving at the cell membrane, suggesting that natural selection may have favored signaling cascades as a parsimonious solution to the problem of generating switch-like behavior in a noisy environment.

ContributorsYoung, Jonathan Trinity (Author) / Armbruster, Dieter (Thesis advisor) / Platte, Rodrigo (Committee member) / Nagy, John (Committee member) / Baer, Steven (Committee member) / Taylor, Jesse (Committee member) / Arizona State University (Publisher)

Created2013

High-order sparsity exploiting methods with applications in imaging and PDEs

Description

High-order methods are known for their accuracy and computational performance when applied to solving partial differential equations and have widespread use

in representing images compactly. Nonetheless, high-order methods have difficulty representing functions containing discontinuities or functions having slow spectral decay in the chosen basis. Certain sensing techniques such as MRI…

High-order methods are known for their accuracy and computational performance when applied to solving partial differential equations and have widespread use

in representing images compactly. Nonetheless, high-order methods have difficulty representing functions containing discontinuities or functions having slow spectral decay in the chosen basis. Certain sensing techniques such as MRI and SAR provide data in terms of Fourier coefficients, and thus prescribe a natural high-order basis. The field of compressed sensing has introduced a set of techniques based on $\ell^1$ regularization that promote sparsity and facilitate working with functions having discontinuities. In this dissertation, high-order methods and $\ell^1$ regularization are used to address three problems: reconstructing piecewise smooth functions from sparse and and noisy Fourier data, recovering edge locations in piecewise smooth functions from sparse and noisy Fourier data, and reducing time-stepping constraints when numerically solving certain time-dependent hyperbolic partial differential equations.

ContributorsDenker, Dennis (Author) / Gelb, Anne (Thesis advisor) / Archibald, Richard (Committee member) / Armbruster, Dieter (Committee member) / Boggess, Albert (Committee member) / Platte, Rodrigo (Committee member) / Saders, Toby (Committee member) / Arizona State University (Publisher)

Created2016

Modeling collective motion of complex systems using agent-based models & macroscopic models

Description

The main objective of mathematical modeling is to connect mathematics with other scientific fields. Developing predictable models help to understand the behavior of biological systems. By testing models, one can relate mathematics and real-world experiments. To validate predictions numerically, one has to compare them with experimental data sets. Mathematical modeling…

The main objective of mathematical modeling is to connect mathematics with other scientific fields. Developing predictable models help to understand the behavior of biological systems. By testing models, one can relate mathematics and real-world experiments. To validate predictions numerically, one has to compare them with experimental data sets. Mathematical modeling can be split into two groups: microscopic and macroscopic models. Microscopic models described the motion of so-called agents (e.g. cells, ants) that interact with their surrounding neighbors. The interactions among these agents form at a large scale some special structures such as flocking and swarming. One of the key questions is to relate the particular interactions among agents with the overall emerging structures. Macroscopic models are precisely designed to describe the evolution of such large structures. They are usually given as partial differential equations describing the time evolution of a density distribution (instead of tracking each individual agent). For instance, reaction-diffusion equations are used to model glioma cells and are being used to predict tumor growth. This dissertation aims at developing such a framework to better understand the complex behavior of foraging ants and glioma cells.

ContributorsJamous, Sara Sami (Author) / Motsch, Sebastien (Thesis advisor) / Armbruster, Dieter (Committee member) / Camacho, Erika (Committee member) / Moustaoui, Mohamed (Committee member) / Platte, Rodrigo (Committee member) / Arizona State University (Publisher)

Created2019

Spatial Regression and Gaussian Process BART

Description

Spatial regression is one of the central topics in spatial statistics. Based on the goals, interpretation or prediction, spatial regression models can be classified into two categories, linear mixed regression models and nonlinear regression models. This dissertation explored these models and their real world applications. New methods and models were…

Spatial regression is one of the central topics in spatial statistics. Based on the goals, interpretation or prediction, spatial regression models can be classified into two categories, linear mixed regression models and nonlinear regression models. This dissertation explored these models and their real world applications. New methods and models were proposed to overcome the challenges in practice. There are three major parts in the dissertation.

In the first part, nonlinear regression models were embedded into a multistage workflow to predict the spatial abundance of reef fish species in the Gulf of Mexico. There were two challenges, zero-inflated data and out of sample prediction. The methods and models in the workflow could effectively handle the zero-inflated sampling data without strong assumptions. Three strategies were proposed to solve the out of sample prediction problem. The results and discussions showed that the nonlinear prediction had the advantages of high accuracy, low bias and well-performed in multi-resolution.

In the second part, a two-stage spatial regression model was proposed for analyzing soil carbon stock (SOC) data. In the first stage, there was a spatial linear mixed model that captured the linear and stationary effects. In the second stage, a generalized additive model was used to explain the nonlinear and nonstationary effects. The results illustrated that the two-stage model had good interpretability in understanding the effect of covariates, meanwhile, it kept high prediction accuracy which is competitive to the popular machine learning models, like, random forest, xgboost and support vector machine.

A new nonlinear regression model, Gaussian process BART (Bayesian additive regression tree), was proposed in the third part. Combining advantages in both BART and Gaussian process, the model could capture the nonlinear effects of both observed and latent covariates. To develop the model, first, the traditional BART was generalized to accommodate correlated errors. Then, the failure of likelihood based Markov chain Monte Carlo (MCMC) in parameter estimating was discussed. Based on the idea of analysis of variation, back comparing and tuning range, were proposed to tackle this failure. Finally, effectiveness of the new model was examined by experiments on both simulation and real data.

ContributorsLu, Xuetao (Author) / McCulloch, Robert (Thesis advisor) / Hahn, Paul (Committee member) / Lan, Shiwei (Committee member) / Zhou, Shuang (Committee member) / Saul, Steven (Committee member) / Arizona State University (Publisher)

Created2020

Chance-constrained optimization models for agricultural seed development and selection

Description

Breeding seeds to include desirable traits (increased yield, drought/temperature resistance, etc.) is a growing and important method of establishing food security. However, besides breeder intuition, few decision-making tools exist that can provide the breeders with credible evidence to make decisions on which seeds to progress to further stages of development.…

Breeding seeds to include desirable traits (increased yield, drought/temperature resistance, etc.) is a growing and important method of establishing food security. However, besides breeder intuition, few decision-making tools exist that can provide the breeders with credible evidence to make decisions on which seeds to progress to further stages of development. This thesis attempts to create a chance-constrained knapsack optimization model, which the breeder can use to make better decisions about seed progression and help reduce the levels of risk in their selections. The model’s objective is to select seed varieties out of a larger pool of varieties and maximize the average yield of the “knapsack” based on meeting some risk criteria. Two models are created for different cases. First is the risk reduction model which seeks to reduce the risk of getting a bad yield but still maximize the total yield. The second model considers the possibility of adverse environmental effects and seeks to mitigate the negative effects it could have on the total yield. In practice, breeders can use these models to better quantify uncertainty in selecting seed varieties

ContributorsOzcan, Ozkan Meric (Author) / Armbruster, Dieter (Thesis advisor) / Gel, Esma (Thesis advisor) / Sefair, Jorge (Committee member) / Arizona State University (Publisher)

Created2019

Fatigue and Free Throw Shooting Ability in the NBA

Description

We attempt to analyze the effect of fatigue on free throw efficiency in the National Basketball Association (NBA) using play-by-play data from regular-season, regulation-length games in the 2016-2017, 2017-2018, and 2018-2019 seasons. Using both regression and tree-based statistical methods, we analyze the relationship between minutes played total and minutes played…

We attempt to analyze the effect of fatigue on free throw efficiency in the National Basketball Association (NBA) using play-by-play data from regular-season, regulation-length games in the 2016-2017, 2017-2018, and 2018-2019 seasons. Using both regression and tree-based statistical methods, we analyze the relationship between minutes played total and minutes played continuously at the time of free throw attempts on players' odds of making an attempt, while controlling for prior free throw shooting ability, longer-term fatigue, and other game factors. Our results offer strong evidence that short-term activity after periods of inactivity positively affects free throw efficiency, while longer-term fatigue has no effect.

ContributorsRisch, Oliver (Author) / Armbruster, Dieter (Thesis director) / Hahn, P. Richard (Committee member) / School of Mathematical and Statistical Sciences (Contributor) / Barrett, The Honors College (Contributor)

Created2021-05

NBA Player Clustering: Exploring Player Archetypes in a Changing NBA

Description

The findings of this project show that through the use of principal component analysis and K-Means clustering, NBA players can be algorithmically classified in distinct clusters, representing a player archetype. Individual player data for the 2018-2019 regular season was collected for 150 players, and this included regular per game statistics,…

The findings of this project show that through the use of principal component analysis and K-Means clustering, NBA players can be algorithmically classified in distinct clusters, representing a player archetype. Individual player data for the 2018-2019 regular season was collected for 150 players, and this included regular per game statistics, such as rebounds, assists, field goals, etc., and advanced statistics, such as usage percentage, win shares, and value over replacement players. The analysis was achieved using the statistical programming language R on the integrated development environment RStudio. The principal component analysis was computed first in order to produce a set of five principal components, which explain roughly 82.20% of the total variance within the player data. These five principal components were then used as the parameters the players were clustered against in the K-Means clustering algorithm implemented in R. It was determined that eight clusters would best represent the groupings of the players, and eight clusters were created with a unique set of players belonging to each one. Each cluster was analyzed based on the players making up the cluster and a player archetype was established to define each of the clusters. The reasoning behind the player archetypes given to each cluster was explained, providing details as to why the players were clustered together and the main data features that influenced the clustering results. Besides two of the clusters, the archetypes were proven to be independent of the player's position. The clustering results can be expanded on in the future to include a larger sample size of players, and it can be used to make inferences regarding NBA roster construction. The clustering can highlight key weaknesses in rosters and show which combinations of player archetypes lead to team success.

ContributorsElam, Mason Matthew (Author) / Armbruster, Dieter (Thesis director) / Gel, Esma (Committee member) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2019-05