## Matching Items (9)

##### Filtering by

- All Subjects: Statistics
- All Subjects: spatial statistics
- Member of: ASU Electronic Theses and Dissertations

Though the likelihood is a useful tool for obtaining estimates of regression parameters, it is not readily available in the fit of hierarchical binary data models. The correlated observations negate the opportunity to have a joint likelihood when fitting hierarchical logistic regression models. Through conditional likelihood, inferences for the regression…

Though the likelihood is a useful tool for obtaining estimates of regression parameters, it is not readily available in the fit of hierarchical binary data models. The correlated observations negate the opportunity to have a joint likelihood when fitting hierarchical logistic regression models. Through conditional likelihood, inferences for the regression and covariance parameters as well as the intraclass correlation coefficients are usually obtained. In those cases, I have resorted to use of Laplace approximation and large sample theory approach for point and interval estimates such as Wald-type confidence intervals and profile likelihood confidence intervals. These methods rely on distributional assumptions and large sample theory. However, when dealing with small hierarchical datasets they often result in severe bias or non-convergence. I present a generalized quasi-likelihood approach and a generalized method of moments approach; both do not rely on any distributional assumptions but only moments of response. As an alternative to the typical large sample theory approach, I present bootstrapping hierarchical logistic regression models which provides more accurate interval estimates for small binary hierarchical data. These models substitute computations as an alternative to the traditional Wald-type and profile likelihood confidence intervals. I use a latent variable approach with a new split bootstrap method for estimating intraclass correlation coefficients when analyzing binary data obtained from a three-level hierarchical structure. It is especially useful with small sample size and easily expanded to multilevel. Comparisons are made to existing approaches through both theoretical justification and simulation studies. Further, I demonstrate my findings through an analysis of three numerical examples, one based on cancer in remission data, one related to the China’s antibiotic abuse study, and a third related to teacher effectiveness in schools from a state of southwest US.

Correlation is common in many types of data, including those collected through longitudinal studies or in a hierarchical structure. In the case of clustering, or repeated measurements, there is inherent correlation between observations within the same group, or between observations obtained on the same subject. Longitudinal studies also introduce association…

Correlation is common in many types of data, including those collected through longitudinal studies or in a hierarchical structure. In the case of clustering, or repeated measurements, there is inherent correlation between observations within the same group, or between observations obtained on the same subject. Longitudinal studies also introduce association between the covariates and the outcomes across time. When multiple outcomes are of interest, association may exist between the various models. These correlations can lead to issues in model fitting and inference if not properly accounted for. This dissertation presents three papers discussing appropriate methods to properly consider different types of association. The first paper introduces an ANOVA based measure of intraclass correlation for three level hierarchical data with binary outcomes, and corresponding properties. This measure is useful for evaluating when the correlation due to clustering warrants a more complex model. This measure is used to investigate AIDS knowledge in a clustered study conducted in Bangladesh. The second paper develops the Partitioned generalized method of moments (Partitioned GMM) model for longitudinal studies. This model utilizes valid moment conditions to separately estimate the varying effects of each time-dependent covariate on the outcome over time using multiple coefficients. The model is fit to data from the National Longitudinal Study of Adolescent to Adult Health (Add Health) to investigate risk factors of childhood obesity. In the third paper, the Partitioned GMM model is extended to jointly estimate regression models for multiple outcomes of interest. Thus, this approach takes into account both the correlation between the multivariate outcomes, as well as the correlation due to time-dependency in longitudinal studies. The model utilizes an expanded weight matrix and objective function composed of valid moment conditions to simultaneously estimate optimal regression coefficients. This approach is applied to Add Health data to simultaneously study drivers of outcomes including smoking, social alcohol usage, and obesity in children.

In the presence of correlation, generalized linear models cannot be employed to obtain regression parameter estimates. To appropriately address the extravariation due to correlation, methods to estimate and model the additional variation are investigated. A general form of the mean-variance relationship is proposed which incorporates the canonical parameter. The two…

In the presence of correlation, generalized linear models cannot be employed to obtain regression parameter estimates. To appropriately address the extravariation due to correlation, methods to estimate and model the additional variation are investigated. A general form of the mean-variance relationship is proposed which incorporates the canonical parameter. The two variance parameters are estimated using generalized method of moments, negating the need for a distributional assumption. The mean-variance relation estimates are applied to clustered data and implemented in an adjusted generalized quasi-likelihood approach through an adjustment to the covariance matrix. In the presence of significant correlation in hierarchical structured data, the adjusted generalized quasi-likelihood model shows improved performance for random effect estimates. In addition, submodels to address deviation in skewness and kurtosis are provided to jointly model the mean, variance, skewness, and kurtosis. The additional models identify covariates influencing the third and fourth moments. A cutoff to trim the data is provided which improves parameter estimation and model fit. For each topic, findings are demonstrated through comprehensive simulation studies and numerical examples. Examples evaluated include data on children’s morbidity in the Philippines, adolescent health from the National Longitudinal Study of Adolescent to Adult Health, as well as proteomic assays for breast cancer screening.

Generalized Linear Models (GLMs) are widely used for modeling responses with non-normal error distributions. When the values of the covariates in such models are controllable, finding an optimal (or at least efficient) design could greatly facilitate the work of collecting and analyzing data. In fact, many theoretical results are obtained…

Generalized Linear Models (GLMs) are widely used for modeling responses with non-normal error distributions. When the values of the covariates in such models are controllable, finding an optimal (or at least efficient) design could greatly facilitate the work of collecting and analyzing data. In fact, many theoretical results are obtained on a case-by-case basis, while in other situations, researchers also rely heavily on computational tools for design selection.

Three topics are investigated in this dissertation with each one focusing on one type of GLMs. Topic I considers GLMs with factorial effects and one continuous covariate. Factors can have interactions among each other and there is no restriction on the possible values of the continuous covariate. The locally D-optimal design structures for such models are identified and results for obtaining smaller optimal designs using orthogonal arrays (OAs) are presented. Topic II considers GLMs with multiple covariates under the assumptions that all but one covariate are bounded within specified intervals and interaction effects among those bounded covariates may also exist. An explicit formula for D-optimal designs is derived and OA-based smaller D-optimal designs for models with one or two two-factor interactions are also constructed. Topic III considers multiple-covariate logistic models. All covariates are nonnegative and there is no interaction among them. Two types of D-optimal design structures are identified and their global D-optimality is proved using the celebrated equivalence theorem.

This dissertation develops versatile modeling tools to estimate causal effects when conditional unconfoundedness is not immediately satisfied. Chapter 2 provides a brief overview ofcommon techniques in causal inference, with a focus on models relevant to the data explored

in later chapters. The rest of the dissertation focuses on the development…

This dissertation develops versatile modeling tools to estimate causal effects when conditional unconfoundedness is not immediately satisfied. Chapter 2 provides a brief overview ofcommon techniques in causal inference, with a focus on models relevant to the data explored

in later chapters. The rest of the dissertation focuses on the development of novel “reduced

form” models which are designed to assess the particular challenges of different datasets.

Chapter 3 explores the question of whether or not forecasts of bankruptcy cause bankruptcy.

The question arises from the observation that companies issued going concern opinions were

more likely to go bankrupt in the following year, leading people to speculate that the opinions themselves caused the bankruptcy via a “self-fulfilling prophecy”. A Bayesian machine

learning sensitivity analysis is developed to answer this question. In exchange for additional

flexibility and fewer assumptions, this approach loses point identification of causal effects

and thus a sensitivity analysis is developed to study a wide range of plausible scenarios of

the causal effect of going concern opinions on bankruptcy. Reported in the simulations are

different performance metrics of the model in comparison with other popular methods and a

robust analysis of the sensitivity of the model to mis-specification. Results on empirical data

indicate that forecasts of bankruptcies likely do have a small causal effect.

Chapter 4 studies the effects of vaccination on COVID-19 mortality at the state level in

the United States. The dynamic nature of the pandemic complicates more straightforward

regression adjustments and invalidates many alternative models. The chapter comments on

the limitations of mechanistic approaches as well as traditional statistical methods to epidemiological data. Instead, a state space model is developed that allows the study of the

ever-changing dynamics of the pandemic’s progression. In the first stage, the model decomposes the observed mortality data into component surges, and later uses this information in

a semi-parametric regression model for causal analysis. Results are investigated thoroughly

for empirical justification and stress-tested in simulated settings.

This dissertation covers several topics in machine learning and causal inference. First, the question of “feature selection,” a common byproduct of regularized machine learning methods, is investigated theoretically in the context of treatment effect estimation. This involves a detailed review and extension of frameworks for estimating causal effects and in-depth…

This dissertation covers several topics in machine learning and causal inference. First, the question of “feature selection,” a common byproduct of regularized machine learning methods, is investigated theoretically in the context of treatment effect estimation. This involves a detailed review and extension of frameworks for estimating causal effects and in-depth theoretical study. Next, various computational approaches to estimating causal effects with machine learning methods are compared with these theoretical desiderata in mind. Several improvements to current methods for causal machine learning are identified and compelling angles for further study are pinpointed. Finally, a common method used for “explaining” predictions of machine learning algorithms, SHAP, is evaluated critically through a statistical lens.

Tracking disease cases is an essential task in public health; however, tracking the number of cases of a disease may be difficult not every infection can be recorded by public health authorities. Notably, this may happen with whole country measles case reports, even such countries with robust registration systems.…

Tracking disease cases is an essential task in public health; however, tracking the number of cases of a disease may be difficult not every infection can be recorded by public health authorities. Notably, this may happen with whole country measles case reports, even such countries with robust registration systems. Eilertson et al. (2019) propose using a state-space model combined with maximum likelihood methods for estimating measles transmission. A Bayesian approach that uses particle Markov Chain Monte Carlo (pMCMC) is proposed to estimate the parameters of the non-linear state-space model developed in Eilertson et al. (2019) and similar previous studies. This dissertation illustrates the performance of this approach by calculating posterior estimates of the model parameters and predictions of the unobserved states in simulations and case studies. Also, Iteration Filtering (IF2) is used as a support method to verify the Bayesian estimation and to inform the selection of prior distributions. In the second half of the thesis, a birth-death process is proposed to model the unobserved population size of a disease vector. This model studies the effect of a disease vector population size on a second affected population. The second population follows a non-homogenous Poisson process when conditioned on the vector process with a transition rate given by a scaled version of the vector population. The observation model also measures a potential threshold event when the host species population size surpasses a certain level yielding a higher transmission rate. A maximum likelihood procedure is developed for this model, which combines particle filtering with the Minorize-Maximization (MM) algorithm and extends the work of Crawford et al. (2014).

Spatial regression is one of the central topics in spatial statistics. Based on the goals, interpretation or prediction, spatial regression models can be classified into two categories, linear mixed regression models and nonlinear regression models. This dissertation explored these models and their real world applications. New methods and models were…

Spatial regression is one of the central topics in spatial statistics. Based on the goals, interpretation or prediction, spatial regression models can be classified into two categories, linear mixed regression models and nonlinear regression models. This dissertation explored these models and their real world applications. New methods and models were proposed to overcome the challenges in practice. There are three major parts in the dissertation.

In the first part, nonlinear regression models were embedded into a multistage workflow to predict the spatial abundance of reef fish species in the Gulf of Mexico. There were two challenges, zero-inflated data and out of sample prediction. The methods and models in the workflow could effectively handle the zero-inflated sampling data without strong assumptions. Three strategies were proposed to solve the out of sample prediction problem. The results and discussions showed that the nonlinear prediction had the advantages of high accuracy, low bias and well-performed in multi-resolution.

In the second part, a two-stage spatial regression model was proposed for analyzing soil carbon stock (SOC) data. In the first stage, there was a spatial linear mixed model that captured the linear and stationary effects. In the second stage, a generalized additive model was used to explain the nonlinear and nonstationary effects. The results illustrated that the two-stage model had good interpretability in understanding the effect of covariates, meanwhile, it kept high prediction accuracy which is competitive to the popular machine learning models, like, random forest, xgboost and support vector machine.

A new nonlinear regression model, Gaussian process BART (Bayesian additive regression tree), was proposed in the third part. Combining advantages in both BART and Gaussian process, the model could capture the nonlinear effects of both observed and latent covariates. To develop the model, first, the traditional BART was generalized to accommodate correlated errors. Then, the failure of likelihood based Markov chain Monte Carlo (MCMC) in parameter estimating was discussed. Based on the idea of analysis of variation, back comparing and tuning range, were proposed to tackle this failure. Finally, effectiveness of the new model was examined by experiments on both simulation and real data.

The Pearson and likelihood ratio statistics are well-known in goodness-of-fit testing and are commonly used for models applied to multinomial count data. When data are from a table formed by the cross-classification of a large number of variables, these goodness-of-fit statistics may have lower power and inaccurate Type I error…

The Pearson and likelihood ratio statistics are well-known in goodness-of-fit testing and are commonly used for models applied to multinomial count data. When data are from a table formed by the cross-classification of a large number of variables, these goodness-of-fit statistics may have lower power and inaccurate Type I error rate due to sparseness. Pearson's statistic can be decomposed into orthogonal components associated with the marginal distributions of observed variables, and an omnibus fit statistic can be obtained as a sum of these components. When the statistic is a sum of components for lower-order marginals, it has good performance for Type I error rate and statistical power even when applied to a sparse table. In this dissertation, goodness-of-fit statistics using orthogonal components based on second- third- and fourth-order marginals were examined. If lack-of-fit is present in higher-order marginals, then a test that incorporates the higher-order marginals may have a higher power than a test that incorporates only first- and/or second-order marginals. To this end, two new statistics based on the orthogonal components of Pearson's chi-square that incorporate third- and fourth-order marginals were developed, and the Type I error, empirical power, and asymptotic power under different sparseness conditions were investigated. Individual orthogonal components as test statistics to identify lack-of-fit were also studied. The performance of individual orthogonal components to other popular lack-of-fit statistics were also compared. When the number of manifest variables becomes larger than 20, most of the statistics based on marginal distributions have limitations in terms of computer resources and CPU time. Under this problem, when the number manifest variables is larger than or equal to 20, the performance of a bootstrap based method to obtain p-values for Pearson-Fisher statistic, fit to confirmatory dichotomous variable factor analysis model, and the performance of Tollenaar and Mooijaart (2003) statistic were investigated.