Matching Items (10)
Filtering by

Clear all filters

171508-Thumbnail Image.png
Description
Longitudinal data involving multiple subjects is quite popular in medical and social science areas. I consider generalized linear mixed models (GLMMs) applied to such longitudinal data, and the optimal design searching problem under such models. In this case, based on optimal design theory, the optimality criteria depend on the estimated

Longitudinal data involving multiple subjects is quite popular in medical and social science areas. I consider generalized linear mixed models (GLMMs) applied to such longitudinal data, and the optimal design searching problem under such models. In this case, based on optimal design theory, the optimality criteria depend on the estimated parameters, which leads to local optimality. Moreover, the information matrix under a GLMM doesn't have a closed-form expression. My dissertation includes three topics related to this design problem. The first part is searching for locally optimal designs under GLMMs with longitudinal data. I apply penalized quasi-likelihood (PQL) method to approximate the information matrix and compare several approximations to show the superiority of PQL over other approximations. Under different local parameters and design restrictions, locally D- and A- optimal designs are constructed based on the approximation. An interesting finding is that locally optimal designs sometimes apply different designs to different subjects. Finally, the robustness of these locally optimal designs is discussed. In the second part, an unknown observational covariate is added to the previous model. With an unknown observational variable in the experiment, expected optimality criteria are considered. Under different assumptions of the unknown variable and parameter settings, locally optimal designs are constructed and discussed. In the last part, Bayesian optimal designs are considered under logistic mixed models. Considering different priors of the local parameters, Bayesian optimal designs are generated. Bayesian design under such a model is usually expensive in time. The running time in this dissertation is optimized to an acceptable amount with accurate results. I also discuss the robustness of these Bayesian optimal designs, which is the motivation of applying such an approach.
ContributorsShi, Yao (Author) / Stufken, John (Thesis advisor) / Kao, Ming-Hung (Thesis advisor) / Lan, Shiwei (Committee member) / Pan, Rong (Committee member) / Reiser, Mark (Committee member) / Arizona State University (Publisher)
Created2022
187808-Thumbnail Image.png
Description
This dissertation covers several topics in machine learning and causal inference. First, the question of “feature selection,” a common byproduct of regularized machine learning methods, is investigated theoretically in the context of treatment effect estimation. This involves a detailed review and extension of frameworks for estimating causal effects and in-depth

This dissertation covers several topics in machine learning and causal inference. First, the question of “feature selection,” a common byproduct of regularized machine learning methods, is investigated theoretically in the context of treatment effect estimation. This involves a detailed review and extension of frameworks for estimating causal effects and in-depth theoretical study. Next, various computational approaches to estimating causal effects with machine learning methods are compared with these theoretical desiderata in mind. Several improvements to current methods for causal machine learning are identified and compelling angles for further study are pinpointed. Finally, a common method used for “explaining” predictions of machine learning algorithms, SHAP, is evaluated critically through a statistical lens.
ContributorsHerren, Andrew (Author) / Hahn, P Richard (Thesis advisor) / Kao, Ming-Hung (Committee member) / Lopes, Hedibert (Committee member) / McCulloch, Robert (Committee member) / Zhou, Shuang (Committee member) / Arizona State University (Publisher)
Created2023
187738-Thumbnail Image.png
Description
Humans cooperate at levels unseen in other species. Identifying the adaptive mechanisms driving this unusual behavior, as well as how these mechanisms interact to create complex cooperative patterns, remains an open question in anthropology. One impediment to such investigations is that complete, long-term datasets of human cooperative behaviors in small-scale

Humans cooperate at levels unseen in other species. Identifying the adaptive mechanisms driving this unusual behavior, as well as how these mechanisms interact to create complex cooperative patterns, remains an open question in anthropology. One impediment to such investigations is that complete, long-term datasets of human cooperative behaviors in small-scale societies are hard to come by; such field research is often hindered both by humans' long lifespans and by the difficulties of collecting data in remote societies. In this study, I attempted to overcome these methodological challenges by simulating individual human cooperative behaviors in a small-scale population. Using an agent-based model tuned to population-level measurements from a real-life marine subsistence population in the southern Philippines, I generated dynamic daily cooperative behaviors in a hypothetical subsistence population over a period of 1500 years and 42 overlapping generations. Preliminary findings from the model suggest that, while the agent-based model broadly captured a number of characteristic population-level patterns in the subsistence population, it did not fully replicate nuances of the population's observed cooperative behaviors. In particular, statistical models of the simulated data identified reciprocity-based and need-based cooperative behaviors but did not detect kinship-motivated cooperation, despite the fact that kin cooperation traits evolved positively and reciprocity cooperation traits evolved negatively over time in the agent population. It is possible that this discrepancy reflects a complex interaction between kinship and reciprocity in the agent-based model. On the other hand, it may also suggest that these types of statistical models, which are frequently utilized in human cooperation studies in the anthropological literature, do not reliably discriminate between kin-based and reciprocity-based cooperation mechanisms when both exist in a population. Even so, the completeness of the simulated data enabled use of more complex statistical methodologies which were able to disentangle the relative effects of cooperative mechanisms operating at different decision levels. By addressing remaining pattern-matching issues, future iterations of the agent-based model may prove to be a useful tool for validating empirical research and investigating novel hypotheses about the evolution and maintenance of cooperative behaviors in human populations.
ContributorsPhelps, Julia R. (Author) / Reiser, Mark (Thesis advisor) / Saul, Steven (Thesis advisor) / Morgan, Thomas (Committee member) / Arizona State University (Publisher)
Created2023
187395-Thumbnail Image.png
Description
This dissertation develops versatile modeling tools to estimate causal effects when conditional unconfoundedness is not immediately satisfied. Chapter 2 provides a brief overview ofcommon techniques in causal inference, with a focus on models relevant to the data explored in later chapters. The rest of the dissertation focuses on the development of

This dissertation develops versatile modeling tools to estimate causal effects when conditional unconfoundedness is not immediately satisfied. Chapter 2 provides a brief overview ofcommon techniques in causal inference, with a focus on models relevant to the data explored in later chapters. The rest of the dissertation focuses on the development of novel “reduced form” models which are designed to assess the particular challenges of different datasets. Chapter 3 explores the question of whether or not forecasts of bankruptcy cause bankruptcy. The question arises from the observation that companies issued going concern opinions were more likely to go bankrupt in the following year, leading people to speculate that the opinions themselves caused the bankruptcy via a “self-fulfilling prophecy”. A Bayesian machine learning sensitivity analysis is developed to answer this question. In exchange for additional flexibility and fewer assumptions, this approach loses point identification of causal effects and thus a sensitivity analysis is developed to study a wide range of plausible scenarios of the causal effect of going concern opinions on bankruptcy. Reported in the simulations are different performance metrics of the model in comparison with other popular methods and a robust analysis of the sensitivity of the model to mis-specification. Results on empirical data indicate that forecasts of bankruptcies likely do have a small causal effect. Chapter 4 studies the effects of vaccination on COVID-19 mortality at the state level in the United States. The dynamic nature of the pandemic complicates more straightforward regression adjustments and invalidates many alternative models. The chapter comments on the limitations of mechanistic approaches as well as traditional statistical methods to epidemiological data. Instead, a state space model is developed that allows the study of the ever-changing dynamics of the pandemic’s progression. In the first stage, the model decomposes the observed mortality data into component surges, and later uses this information in a semi-parametric regression model for causal analysis. Results are investigated thoroughly for empirical justification and stress-tested in simulated settings.
ContributorsPapakostas, Demetrios (Author) / Hahn, Paul (Thesis advisor) / McCulloch, Robert (Committee member) / Zhou, Shuang (Committee member) / Kao, Ming-Hung (Committee member) / Lan, Shiwei (Committee member) / Arizona State University (Publisher)
Created2023
171927-Thumbnail Image.png
Description
Tracking disease cases is an essential task in public health; however, tracking the number of cases of a disease may be difficult not every infection can be recorded by public health authorities. Notably, this may happen with whole country measles case reports, even such countries with robust registration systems.

Tracking disease cases is an essential task in public health; however, tracking the number of cases of a disease may be difficult not every infection can be recorded by public health authorities. Notably, this may happen with whole country measles case reports, even such countries with robust registration systems. Eilertson et al. (2019) propose using a state-space model combined with maximum likelihood methods for estimating measles transmission. A Bayesian approach that uses particle Markov Chain Monte Carlo (pMCMC) is proposed to estimate the parameters of the non-linear state-space model developed in Eilertson et al. (2019) and similar previous studies. This dissertation illustrates the performance of this approach by calculating posterior estimates of the model parameters and predictions of the unobserved states in simulations and case studies. Also, Iteration Filtering (IF2) is used as a support method to verify the Bayesian estimation and to inform the selection of prior distributions. In the second half of the thesis, a birth-death process is proposed to model the unobserved population size of a disease vector. This model studies the effect of a disease vector population size on a second affected population. The second population follows a non-homogenous Poisson process when conditioned on the vector process with a transition rate given by a scaled version of the vector population. The observation model also measures a potential threshold event when the host species population size surpasses a certain level yielding a higher transmission rate. A maximum likelihood procedure is developed for this model, which combines particle filtering with the Minorize-Maximization (MM) algorithm and extends the work of Crawford et al. (2014).
ContributorsMartinez Rivera, Wilmer Osvaldo (Author) / Fricks, John (Thesis advisor) / Reiser, Mark (Committee member) / Zhou, Shuang (Committee member) / Cheng, Dan (Committee member) / Lan, Shiwei (Committee member) / Arizona State University (Publisher)
Created2022
171467-Thumbnail Image.png
Description
Goodness-of-fit test is a hypothesis test used to test whether a given model fit the data well. It is extremely difficult to find a universal goodness-of-fit test that can test all types of statistical models. Moreover, traditional Pearson’s chi-square goodness-of-fit test is sometimes considered to be an omnibus test but

Goodness-of-fit test is a hypothesis test used to test whether a given model fit the data well. It is extremely difficult to find a universal goodness-of-fit test that can test all types of statistical models. Moreover, traditional Pearson’s chi-square goodness-of-fit test is sometimes considered to be an omnibus test but not a directional test so it is hard to find the source of poor fit when the null hypothesis is rejected and it will lose its validity and effectiveness in some of the special conditions. Sparseness is such an abnormal condition. One effective way to overcome the adverse effects of sparseness is to use limited-information statistics. In this dissertation, two topics about constructing and using limited-information statistics to overcome sparseness for binary data will be included. In the first topic, the theoretical framework of pairwise concordance and the transformation matrix which is used to extract the corresponding marginals and their generalizations are provided. Then a series of new chi-square test statistics and corresponding orthogonal components are proposed, which are used to detect the model misspecification for longitudinal binary data. One of the important conclusions is, the test statistic $X^2_{2c}$ can be taken as an extension of $X^2_{[2]}$, the second-order marginals of traditional Pearson’s chi-square statistic. In the second topic, the research interest is to investigate the effect caused by different intercept patterns when using Lagrange multiplier (LM) test to find the source of misfit for two items in 2-PL IRT model. Several other directional chi-square test statistics are taken into comparison. The simulation results showed that the intercept pattern does affect the performance of goodness-of-fit test, especially the power to find the source of misfit if the source of misfit does exist. More specifically, the power is directly affected by the `intercept distance' between two misfit variables. Another discovery is, the LM test statistic has the best balance between the accurate Type I error rates and high empirical power, which indicates the LM test is a robust test.
ContributorsXu, Jinhui (Author) / Reiser, Mark (Thesis advisor) / Kao, Ming-Hung (Committee member) / Wilson, Jeffrey (Committee member) / Zheng, Yi (Committee member) / Edwards, Michael (Committee member) / Arizona State University (Publisher)
Created2022
191496-Thumbnail Image.png
Description
This dissertation centers on Bayesian Additive Regression Trees (BART) and Accelerated BART (XBART) and presents a series of models that tackle extrapolation, classification, and causal inference challenges. To improve extrapolation in tree-based models, I propose a method called local Gaussian Process (GP) that combines Gaussian process regression with trained BART

This dissertation centers on Bayesian Additive Regression Trees (BART) and Accelerated BART (XBART) and presents a series of models that tackle extrapolation, classification, and causal inference challenges. To improve extrapolation in tree-based models, I propose a method called local Gaussian Process (GP) that combines Gaussian process regression with trained BART trees. This allows for extrapolation based on the most relevant data points and covariate variables determined by the trees' structure. The local GP technique is extended to the Bayesian causal forest (BCF) models to address the positivity violation issue in causal inference. Additionally, I introduce the LongBet model to estimate time-varying, heterogeneous treatment effects in panel data. Furthermore, I present a Poisson-based model, with a modified likelihood for XBART for the multi-class classification problem.
ContributorsWang, Meijia (Author) / Hahn, Paul (Thesis advisor) / He, Jingyu (Committee member) / Lan, Shiwei (Committee member) / McCulloch, Robert (Committee member) / Zhou, Shuang (Committee member) / Arizona State University (Publisher)
Created2024
158850-Thumbnail Image.png
Description
Spatial regression is one of the central topics in spatial statistics. Based on the goals, interpretation or prediction, spatial regression models can be classified into two categories, linear mixed regression models and nonlinear regression models. This dissertation explored these models and their real world applications. New methods and models were

Spatial regression is one of the central topics in spatial statistics. Based on the goals, interpretation or prediction, spatial regression models can be classified into two categories, linear mixed regression models and nonlinear regression models. This dissertation explored these models and their real world applications. New methods and models were proposed to overcome the challenges in practice. There are three major parts in the dissertation.

In the first part, nonlinear regression models were embedded into a multistage workflow to predict the spatial abundance of reef fish species in the Gulf of Mexico. There were two challenges, zero-inflated data and out of sample prediction. The methods and models in the workflow could effectively handle the zero-inflated sampling data without strong assumptions. Three strategies were proposed to solve the out of sample prediction problem. The results and discussions showed that the nonlinear prediction had the advantages of high accuracy, low bias and well-performed in multi-resolution.

In the second part, a two-stage spatial regression model was proposed for analyzing soil carbon stock (SOC) data. In the first stage, there was a spatial linear mixed model that captured the linear and stationary effects. In the second stage, a generalized additive model was used to explain the nonlinear and nonstationary effects. The results illustrated that the two-stage model had good interpretability in understanding the effect of covariates, meanwhile, it kept high prediction accuracy which is competitive to the popular machine learning models, like, random forest, xgboost and support vector machine.

A new nonlinear regression model, Gaussian process BART (Bayesian additive regression tree), was proposed in the third part. Combining advantages in both BART and Gaussian process, the model could capture the nonlinear effects of both observed and latent covariates. To develop the model, first, the traditional BART was generalized to accommodate correlated errors. Then, the failure of likelihood based Markov chain Monte Carlo (MCMC) in parameter estimating was discussed. Based on the idea of analysis of variation, back comparing and tuning range, were proposed to tackle this failure. Finally, effectiveness of the new model was examined by experiments on both simulation and real data.
ContributorsLu, Xuetao (Author) / McCulloch, Robert (Thesis advisor) / Hahn, Paul (Committee member) / Lan, Shiwei (Committee member) / Zhou, Shuang (Committee member) / Saul, Steven (Committee member) / Arizona State University (Publisher)
Created2020
161250-Thumbnail Image.png
Description
Inside cells, axonal and dendritic transport by motor proteins is a process that is responsible for supplying cargo, such as vesicles and organelles, to support neuronal function. Motor proteins achieve transport through a cycle of chemical and mechanical processes. Particle tracking experiments are used to study this intracellular cargo transport

Inside cells, axonal and dendritic transport by motor proteins is a process that is responsible for supplying cargo, such as vesicles and organelles, to support neuronal function. Motor proteins achieve transport through a cycle of chemical and mechanical processes. Particle tracking experiments are used to study this intracellular cargo transport by recording multi-dimensional, discrete cargo position trajectories over time. However, due to experimental limitations, much of the mechanochemical process cannot be directly observed, making mathematical modeling and statistical inference an essential tool for identifying the underlying mechanisms. The cargo movement during transport is modeled using a switching stochastic differential equation framework that involves classification into one of three proposed hidden regimes. Each regime is characterized by different levels of velocity and stochasticity. The equations are presented as a state-space model with Markovian properties. Through a stochastic expectation-maximization algorithm, statistical inference can be made based on the observed trajectory. Regime predictions and particle location predictions are calculated through an auxiliary particle filter and particle smoother. Based on these predictions, parameters are estimated through maximum likelihood. Diagnostics are proposed that can assess model performance and therefore also be a form of model selection criteria. Model selection is used to find the most accurate regime models and the optimal number of regimes for a certain motor-cargo system. A method for incorporating a second positional dimension is also introduced. These methods are tested on both simulated data and different types of experimental data.
ContributorsCrow, Lauren (Author) / Fricks, John (Thesis advisor) / McKinley, Scott (Committee member) / Hahn, Paul R (Committee member) / Reiser, Mark (Committee member) / Cheng, Dan (Committee member) / Arizona State University (Publisher)
Created2021
190865-Thumbnail Image.png
Description
This dissertation centers on treatment effect estimation in the field of causal inference, and aims to expand the toolkit for effect estimation when the treatment variable is binary. Two new stochastic tree-ensemble methods for treatment effect estimation in the continuous outcome setting are presented. The Accelerated Bayesian Causal Forrest (XBCF)

This dissertation centers on treatment effect estimation in the field of causal inference, and aims to expand the toolkit for effect estimation when the treatment variable is binary. Two new stochastic tree-ensemble methods for treatment effect estimation in the continuous outcome setting are presented. The Accelerated Bayesian Causal Forrest (XBCF) model handles variance via a group-specific parameter, and the Heteroskedastic version of XBCF (H-XBCF) uses a separate tree ensemble to learn covariate-dependent variance. This work also contributes to the field of survival analysis by proposing a new framework for estimating survival probabilities via density regression. Within this framework, the Heteroskedastic Accelerated Bayesian Additive Regression Trees (H-XBART) model, which is also developed as part of this work, is utilized in treatment effect estimation for right-censored survival outcomes. All models have been implemented as part of the XBART R package, and their performance is evaluated via extensive simulation studies with appropriate sets of comparators. The contributed methods achieve similar levels of performance, while being orders of magnitude (sometimes as much as 100x) faster than comparator state-of-the-art methods, thus offering an exciting opportunity for treatment effect estimation in the large data setting.
ContributorsKrantsevich, Nikolay (Author) / Hahn, P Richard (Thesis advisor) / McCulloch, Robert (Committee member) / Zhou, Shuang (Committee member) / Lan, Shiwei (Committee member) / He, Jingyu (Committee member) / Arizona State University (Publisher)
Created2023