Algorithms for Bayesian Conditional Density Estimation on a Large Dataset

Description
This dissertation considers algorithmic techniques for non- or semi-parametric Bayesian estimation of density functions or conditional density functions. Specifically, compu- tational methods are developed for performing Markov chain Monte Carlo (MCMC) simulation from posterior distributions over density functions when the

This dissertation considers algorithmic techniques for non- or semi-parametric Bayesian estimation of density functions or conditional density functions. Specifically, compu- tational methods are developed for performing Markov chain Monte Carlo (MCMC) simulation from posterior distributions over density functions when the dataset being analyzed is quite large, say, millions of observations on dozens or hundreds of covari- ates. The motivating scientific problem is the relationship between low birth weight and various factors such as maternal attributes and prenatal circumstances.Low birth weight is a critical public health concern, contributing to infant mor- tality and childhood disabilities. The dissertation utilizes birth records to investigate the impact of maternal attributes and prenatal circumstances on birth weight through a statistical method called density regression. However, the challenges arise from the estimation of the density function, the presence of outliers, irregular structures in the data, and the complexity of selecting an appropriate method. To address these challenges, the study employs a Bayesian Gaussian mixture model inspired by kernel density methods. Additionally, it develops a fast MCMC algorithm tailored to handle the computational demands of large datasets. A targeted sample selection procedure is introduced to overcome difficulties in analyzing weakly informative data. To further enhance the study’s approach to addressing challenges, a sophisticated clustering methodology is incorporated. The study leverages the creation of clus- ters based on different sizes, emphasizing the scalability and complexity of cluster formation within a dichotomous state space. Valid clusters, representing unique com- binations of data points from distinct feature states, offer a granular understanding of patterns in the dataset.

Details

Contributors
Date Created
2024
Resource Type
Language
  • eng
Note
  • Partial requirement for: Ph.D., Arizona State University, 2024
  • Field of study: Applied Mathematics

Additional Information

English
Extent
  • 64 pages
Open Access
Peer-reviewed

Effects of Stochastic Phase Variation on Parameter Estimation in Dynamical Systems

Description
Accurate parameter estimation in systems of ODEs can be critical to a scientific analysis due to the often physical interpretation of parameters. Historically, researchers have mainly built models incorporating the effects of amplitude variation --- the differing magnitude of responses

Accurate parameter estimation in systems of ODEs can be critical to a scientific analysis due to the often physical interpretation of parameters. Historically, researchers have mainly built models incorporating the effects of amplitude variation --- the differing magnitude of responses at any given point in time that is typically modeled as additive iid Gaussian error --- on parameter estimates. What does not appear to be implemented yet is a model incorporating the effects of phase variation --- the differing points in time where features of a process occur --- as well. I present a Bayesian hierarchical model to address this objective in which a key focus is the improved performance of using Hamiltonian Monte Carlo (HMC) to estimate the posterior distribution. Both simulated and experimentally gathered data are used to demonstrate the performance of the model and consequences of ignoring phase variation. Lastly, I conduct studies on the asymptotic performance of a parameter estimator within a frequency framework.

Details

Contributors
Date Created
2024
Topical Subject
Resource Type
Language
  • eng
Note
  • Partial requirement for: Ph.D., Arizona State University, 2024
  • Field of study: Statistics

Additional Information

English
Extent
  • 213 pages
Open Access
Peer-reviewed

Machine Learning and Causal Inference: Theory, Examples, and Computational Results

Description
This dissertation covers several topics in machine learning and causal inference. First, the question of “feature selection,” a common byproduct of regularized machine learning methods, is investigated theoretically in the context of treatment effect estimation. This involves a detailed review

This dissertation covers several topics in machine learning and causal inference. First, the question of “feature selection,” a common byproduct of regularized machine learning methods, is investigated theoretically in the context of treatment effect estimation. This involves a detailed review and extension of frameworks for estimating causal effects and in-depth theoretical study. Next, various computational approaches to estimating causal effects with machine learning methods are compared with these theoretical desiderata in mind. Several improvements to current methods for causal machine learning are identified and compelling angles for further study are pinpointed. Finally, a common method used for “explaining” predictions of machine learning algorithms, SHAP, is evaluated critically through a statistical lens.

Details

Contributors
Date Created
2023
Resource Type
Language
  • eng
Note
  • Partial requirement for: Ph.D., Arizona State University, 2023
  • Field of study: Statistics

Additional Information

English
Extent
  • 137 pages
Open Access
Peer-reviewed

Case Studies in Machine Learning of Reduced Form Models for Causal Inference

Description
This dissertation develops versatile modeling tools to estimate causal effects when conditional unconfoundedness is not immediately satisfied. Chapter 2 provides a brief overview ofcommon techniques in causal inference, with a focus on models relevant to the data explored in later chapters.

This dissertation develops versatile modeling tools to estimate causal effects when conditional unconfoundedness is not immediately satisfied. Chapter 2 provides a brief overview ofcommon techniques in causal inference, with a focus on models relevant to the data explored in later chapters. The rest of the dissertation focuses on the development of novel “reduced form” models which are designed to assess the particular challenges of different datasets. Chapter 3 explores the question of whether or not forecasts of bankruptcy cause bankruptcy. The question arises from the observation that companies issued going concern opinions were more likely to go bankrupt in the following year, leading people to speculate that the opinions themselves caused the bankruptcy via a “self-fulfilling prophecy”. A Bayesian machine learning sensitivity analysis is developed to answer this question. In exchange for additional flexibility and fewer assumptions, this approach loses point identification of causal effects and thus a sensitivity analysis is developed to study a wide range of plausible scenarios of the causal effect of going concern opinions on bankruptcy. Reported in the simulations are different performance metrics of the model in comparison with other popular methods and a robust analysis of the sensitivity of the model to mis-specification. Results on empirical data indicate that forecasts of bankruptcies likely do have a small causal effect. Chapter 4 studies the effects of vaccination on COVID-19 mortality at the state level in the United States. The dynamic nature of the pandemic complicates more straightforward regression adjustments and invalidates many alternative models. The chapter comments on the limitations of mechanistic approaches as well as traditional statistical methods to epidemiological data. Instead, a state space model is developed that allows the study of the ever-changing dynamics of the pandemic’s progression. In the first stage, the model decomposes the observed mortality data into component surges, and later uses this information in a semi-parametric regression model for causal analysis. Results are investigated thoroughly for empirical justification and stress-tested in simulated settings.

Details

Contributors
Date Created
2023
Resource Type
Language
  • eng
Note
  • Partial requirement for: Ph.D., Arizona State University, 2023
  • Field of study: Mathematics

Additional Information

English
Extent
  • 164 pages
Open Access
Peer-reviewed

Optimal Designs under Logistic Mixed Models

Description
Longitudinal data involving multiple subjects is quite popular in medical and social science areas. I consider generalized linear mixed models (GLMMs) applied to such longitudinal data, and the optimal design searching problem under such models. In this case, based on

Longitudinal data involving multiple subjects is quite popular in medical and social science areas. I consider generalized linear mixed models (GLMMs) applied to such longitudinal data, and the optimal design searching problem under such models. In this case, based on optimal design theory, the optimality criteria depend on the estimated parameters, which leads to local optimality. Moreover, the information matrix under a GLMM doesn't have a closed-form expression. My dissertation includes three topics related to this design problem. The first part is searching for locally optimal designs under GLMMs with longitudinal data. I apply penalized quasi-likelihood (PQL) method to approximate the information matrix and compare several approximations to show the superiority of PQL over other approximations. Under different local parameters and design restrictions, locally D- and A- optimal designs are constructed based on the approximation. An interesting finding is that locally optimal designs sometimes apply different designs to different subjects. Finally, the robustness of these locally optimal designs is discussed. In the second part, an unknown observational covariate is added to the previous model. With an unknown observational variable in the experiment, expected optimality criteria are considered. Under different assumptions of the unknown variable and parameter settings, locally optimal designs are constructed and discussed. In the last part, Bayesian optimal designs are considered under logistic mixed models. Considering different priors of the local parameters, Bayesian optimal designs are generated. Bayesian design under such a model is usually expensive in time. The running time in this dissertation is optimized to an acceptable amount with accurate results. I also discuss the robustness of these Bayesian optimal designs, which is the motivation of applying such an approach.

Details

Contributors
Date Created
2022
Topical Subject
Resource Type
Language
  • eng
Note
  • Partial requirement for: Ph.D., Arizona State University, 2022
  • Field of study: Statistics

Additional Information

English
Extent
  • 92 pages
Open Access
Peer-reviewed

Advances in Directional Goodness-of-fit Testing of Binary Data under Model Misspecification in Case of Sparseness

Description
Goodness-of-fit test is a hypothesis test used to test whether a given model fit the data well. It is extremely difficult to find a universal goodness-of-fit test that can test all types of statistical models. Moreover, traditional Pearson’s chi-square goodness-of-fit

Goodness-of-fit test is a hypothesis test used to test whether a given model fit the data well. It is extremely difficult to find a universal goodness-of-fit test that can test all types of statistical models. Moreover, traditional Pearson’s chi-square goodness-of-fit test is sometimes considered to be an omnibus test but not a directional test so it is hard to find the source of poor fit when the null hypothesis is rejected and it will lose its validity and effectiveness in some of the special conditions. Sparseness is such an abnormal condition. One effective way to overcome the adverse effects of sparseness is to use limited-information statistics. In this dissertation, two topics about constructing and using limited-information statistics to overcome sparseness for binary data will be included. In the first topic, the theoretical framework of pairwise concordance and the transformation matrix which is used to extract the corresponding marginals and their generalizations are provided. Then a series of new chi-square test statistics and corresponding orthogonal components are proposed, which are used to detect the model misspecification for longitudinal binary data. One of the important conclusions is, the test statistic $X^2_{2c}$ can be taken as an extension of $X^2_{[2]}$, the second-order marginals of traditional Pearson’s chi-square statistic. In the second topic, the research interest is to investigate the effect caused by different intercept patterns when using Lagrange multiplier (LM) test to find the source of misfit for two items in 2-PL IRT model. Several other directional chi-square test statistics are taken into comparison. The simulation results showed that the intercept pattern does affect the performance of goodness-of-fit test, especially the power to find the source of misfit if the source of misfit does exist. More specifically, the power is directly affected by the `intercept distance' between two misfit variables. Another discovery is, the LM test statistic has the best balance between the accurate Type I error rates and high empirical power, which indicates the LM test is a robust test.

Details

Contributors
Date Created
2022
Embargo Release Date
Topical Subject
Resource Type
Language
  • eng
Note
  • Partial requirement for: Ph.D., Arizona State University, 2022
  • Field of study: Statistics

Additional Information

English
Extent
  • 241 pages
Open Access
Peer-reviewed

Partition of Unity Methods for Solving Partial Differential Equations on Surfaces

Description
Solving partial differential equations on surfaces has many applications including modeling chemical diffusion, pattern formation, geophysics and texture mapping. This dissertation presents two techniques for solving time dependent partial differential equations on various surfaces using the partition of unity method.

Solving partial differential equations on surfaces has many applications including modeling chemical diffusion, pattern formation, geophysics and texture mapping. This dissertation presents two techniques for solving time dependent partial differential equations on various surfaces using the partition of unity method. A novel spectral cubed sphere method that utilizes the windowed Fourier technique is presented and used for both approximating functions on spherical domains and solving partial differential equations. The spectral cubed sphere method is applied to solve the transport equation as well as the diffusion equation on the unit sphere. The second approach is a partition of unity method with local radial basis function approximations. This technique is also used to explore the effect of the node distribution as it is well known that node choice plays an important role in the accuracy and stability of an approximation. A greedy algorithm is implemented to generate good interpolation nodes using the column pivoting QR factorization. The partition of unity radial basis function method is applied to solve the diffusion equation on the sphere as well as a system of reaction-diffusion equations on multiple surfaces including the surface of a red blood cell, a torus, and the Stanford bunny. Accuracy and stability of both methods are investigated.

Details

Contributors
Date Created
2021
Resource Type
Language
  • eng
Note
  • Partial requirement for: Ph.D., Arizona State University, 2021
  • Field of study: Mathematics

Additional Information

English
Extent
  • 76 pages
Open Access
Peer-reviewed

Contributions to Optimal Experimental Design and Strategic Subdata Selection for Big Data

Description
In this dissertation two research questions in the field of applied experimental design were explored. First, methods for augmenting the three-level screening designs called Definitive Screening Designs (DSDs) were investigated. Second, schemes for strategic subdata selection for nonparametric

In this dissertation two research questions in the field of applied experimental design were explored. First, methods for augmenting the three-level screening designs called Definitive Screening Designs (DSDs) were investigated. Second, schemes for strategic subdata selection for nonparametric predictive modeling with big data were developed.

Under sparsity, the structure of DSDs can allow for the screening and optimization of a system in one step, but in non-sparse situations estimation of second-order models requires augmentation of the DSD. In this work, augmentation strategies for DSDs were considered, given the assumption that the correct form of the model for the response of interest is quadratic. Series of augmented designs were constructed and explored, and power calculations, model-robustness criteria, model-discrimination criteria, and simulation study results were used to identify the number of augmented runs necessary for (1) effectively identifying active model effects, and (2) precisely predicting a response of interest. When the goal is identification of active effects, it is shown that supersaturated designs are sufficient; when the goal is prediction, it is shown that little is gained by augmenting beyond the design that is saturated for the full quadratic model. Surprisingly, augmentation strategies based on the I-optimality criterion do not lead to better predictions than strategies based on the D-optimality criterion.

Computational limitations can render standard statistical methods infeasible in the face of massive datasets, necessitating subsampling strategies. In the big data context, the primary objective is often prediction but the correct form of the model for the response of interest is likely unknown. Here, two new methods of subdata selection were proposed. The first is based on clustering, the second is based on space-filling designs, and both are free from model assumptions. The performance of the proposed methods was explored visually via low-dimensional simulated examples; via real data applications; and via large simulation studies. In all cases the proposed methods were compared to existing, widely used subdata selection methods. The conditions under which the proposed methods provide advantages over standard subdata selection strategies were identified.

Details

Contributors
Date Created
2020
Resource Type
Language
  • eng
Note
  • Doctoral Dissertation Statistics 2020

Additional Information

English
Extent
  • 108 pages
Open Access
Peer-reviewed

Essays on the Modeling of Binary Longitudinal Data with Time-dependent Covariates

Description
Longitudinal studies contain correlated data due to the repeated measurements on the same subject. The changing values of the time-dependent covariates and their association with the outcomes presents another source of correlation. Most methods used to analyze longitudinal data average

Longitudinal studies contain correlated data due to the repeated measurements on the same subject. The changing values of the time-dependent covariates and their association with the outcomes presents another source of correlation. Most methods used to analyze longitudinal data average the effects of time-dependent covariates on outcomes over time and provide a single regression coefficient per time-dependent covariate. This denies researchers the opportunity to follow the changing impact of time-dependent covariates on the outcomes. This dissertation addresses such issue through the use of partitioned regression coefficients in three different papers.

In the first paper, an alternative approach to the partitioned Generalized Method of Moments logistic regression model for longitudinal binary outcomes is presented. This method relies on Bayes estimators and is utilized when the partitioned Generalized Method of Moments model provides numerically unstable estimates of the regression coefficients. It is used to model obesity status in the Add Health study and cognitive impairment diagnosis in the National Alzheimer’s Coordination Center database.

The second paper develops a model that allows the joint modeling of two or more binary outcomes that provide an overall measure of a subject’s trait over time. The simultaneous modelling of all outcomes provides a complete picture of the overall measure of interest. This approach accounts for the correlation among and between the outcomes across time and the changing effects of time-dependent covariates on the outcomes. The model is used to analyze four outcomes measuring overall the quality of life in the Chinese Longitudinal Healthy Longevity Study.

The third paper presents an approach that allows for estimation of cross-sectional and lagged effects of the covariates on the outcome as well as the feedback of the response on future covariates. This is done in two-parts, in part-1, the effects of time-dependent covariates on the outcomes are estimated, then, in part-2, the outcome influences on future values of the covariates are measured. These model parameters are obtained through a Generalized Method of Moments procedure that uses valid moment conditions between the outcome and the covariates. Child morbidity in the Philippines and obesity status in the Add Health data are analyzed.

Details

Contributors
Date Created
2020
Embargo Release Date
Resource Type
Language
  • eng
Note
  • Doctoral Dissertation Statistics 2020

Additional Information

English
Extent
  • 136 pages
Open Access
Peer-reviewed

Comparison of Denominator Degrees of Freedom Approximations for Linear Mixed Models in Small-Sample Simulations

Description
Whilst linear mixed models offer a flexible approach to handle data with multiple sources of random variability, the related hypothesis testing for the fixed effects often encounters obstacles when the sample size is small and the underlying distribution for the

Whilst linear mixed models offer a flexible approach to handle data with multiple sources of random variability, the related hypothesis testing for the fixed effects often encounters obstacles when the sample size is small and the underlying distribution for the test statistic is unknown. Consequently, five methods of denominator degrees of freedom approximations (residual, containment, between-within, Satterthwaite, Kenward-Roger) are developed to overcome this problem. This study aims to evaluate the performance of these five methods with a mixed model consisting of random intercept and random slope. Specifically, simulations are conducted to provide insights on the F-statistics, denominator degrees of freedom and p-values each method gives with respect to different settings of the sample structure, the fixed-effect slopes and the missing-data proportion. The simulation results show that the residual method performs the worst in terms of F-statistics and p-values. Also, Satterthwaite and Kenward-Roger methods tend to be more sensitive to the change of designs. The Kenward-Roger method performs the best in terms of F-statistics when the null hypothesis is true.

Details

Contributors
Date Created
2020
Resource Type
Language
  • eng
Note
  • Masters Thesis Statistics 2020

Additional Information

English
Extent
  • 68 pages
Open Access
Peer-reviewed