Matching Items (8)

Filtering by

Clear all filters

155978-Thumbnail Image.png

Three essays on comparative simulation in three-level hierarchical data structure

Description

Though the likelihood is a useful tool for obtaining estimates of regression parameters, it is not readily available in the fit of hierarchical binary data models. The correlated observations negate the opportunity to have a joint likelihood when fitting hierarchical

Though the likelihood is a useful tool for obtaining estimates of regression parameters, it is not readily available in the fit of hierarchical binary data models. The correlated observations negate the opportunity to have a joint likelihood when fitting hierarchical logistic regression models. Through conditional likelihood, inferences for the regression and covariance parameters as well as the intraclass correlation coefficients are usually obtained. In those cases, I have resorted to use of Laplace approximation and large sample theory approach for point and interval estimates such as Wald-type confidence intervals and profile likelihood confidence intervals. These methods rely on distributional assumptions and large sample theory. However, when dealing with small hierarchical datasets they often result in severe bias or non-convergence. I present a generalized quasi-likelihood approach and a generalized method of moments approach; both do not rely on any distributional assumptions but only moments of response. As an alternative to the typical large sample theory approach, I present bootstrapping hierarchical logistic regression models which provides more accurate interval estimates for small binary hierarchical data. These models substitute computations as an alternative to the traditional Wald-type and profile likelihood confidence intervals. I use a latent variable approach with a new split bootstrap method for estimating intraclass correlation coefficients when analyzing binary data obtained from a three-level hierarchical structure. It is especially useful with small sample size and easily expanded to multilevel. Comparisons are made to existing approaches through both theoretical justification and simulation studies. Further, I demonstrate my findings through an analysis of three numerical examples, one based on cancer in remission data, one related to the China’s antibiotic abuse study, and a third related to teacher effectiveness in schools from a state of southwest US.

Contributors

Agent

Created

Date Created
2017

156371-Thumbnail Image.png

Locally D-optimal designs for generalized linear models

Description

Generalized Linear Models (GLMs) are widely used for modeling responses with non-normal error distributions. When the values of the covariates in such models are controllable, finding an optimal (or at least efficient) design could greatly facilitate the work of collecting

Generalized Linear Models (GLMs) are widely used for modeling responses with non-normal error distributions. When the values of the covariates in such models are controllable, finding an optimal (or at least efficient) design could greatly facilitate the work of collecting and analyzing data. In fact, many theoretical results are obtained on a case-by-case basis, while in other situations, researchers also rely heavily on computational tools for design selection.

Three topics are investigated in this dissertation with each one focusing on one type of GLMs. Topic I considers GLMs with factorial effects and one continuous covariate. Factors can have interactions among each other and there is no restriction on the possible values of the continuous covariate. The locally D-optimal design structures for such models are identified and results for obtaining smaller optimal designs using orthogonal arrays (OAs) are presented. Topic II considers GLMs with multiple covariates under the assumptions that all but one covariate are bounded within specified intervals and interaction effects among those bounded covariates may also exist. An explicit formula for D-optimal designs is derived and OA-based smaller D-optimal designs for models with one or two two-factor interactions are also constructed. Topic III considers multiple-covariate logistic models. All covariates are nonnegative and there is no interaction among them. Two types of D-optimal design structures are identified and their global D-optimality is proved using the celebrated equivalence theorem.

Contributors

Agent

Created

Date Created
2018

155598-Thumbnail Image.png

An information based optimal subdata selection algorithm for big data linear regression and a suitable variable selection algorithm

Description

This article proposes a new information-based subdata selection (IBOSS) algorithm, Squared Scaled Distance Algorithm (SSDA). It is based on the invariance of the determinant of the information matrix under orthogonal transformations, especially rotations. Extensive simulation results show that the new

This article proposes a new information-based subdata selection (IBOSS) algorithm, Squared Scaled Distance Algorithm (SSDA). It is based on the invariance of the determinant of the information matrix under orthogonal transformations, especially rotations. Extensive simulation results show that the new IBOSS algorithm retains nice asymptotic properties of IBOSS and gives a larger determinant of the subdata information matrix. It has the same order of time complexity as the D-optimal IBOSS algorithm. However, it exploits the advantages of vectorized calculation avoiding for loops and is approximately 6 times as fast as the D-optimal IBOSS algorithm in R. The robustness of SSDA is studied from three aspects: nonorthogonality, including interaction terms and variable misspecification. A new accurate variable selection algorithm is proposed to help the implementation of IBOSS algorithms when a large number of variables are present with sparse important variables among them. Aggregating random subsample results, this variable selection algorithm is much more accurate than the LASSO method using full data. Since the time complexity is associated with the number of variables only, it is also very computationally efficient if the number of variables is fixed as n increases and not massively large. More importantly, using subsamples it solves the problem that full data cannot be stored in the memory when a data set is too large.

Contributors

Agent

Created

Date Created
2017

155445-Thumbnail Image.png

A power study of Gffit statistics as somponents of Pearson chi-square

Description

The Pearson and likelihood ratio statistics are commonly used to test goodness-of-fit for models applied to data from a multinomial distribution. When data are from a table formed by cross-classification of a large number of variables, the common statistics may

The Pearson and likelihood ratio statistics are commonly used to test goodness-of-fit for models applied to data from a multinomial distribution. When data are from a table formed by cross-classification of a large number of variables, the common statistics may have low power and inaccurate Type I error level due to sparseness in the cells of the table. The GFfit statistic can be used to examine model fit in subtables. It is proposed to assess model fit by using a new version of GFfit statistic based on orthogonal components of Pearson chi-square as a diagnostic to examine the fit on two-way subtables. However, due to variables with a large number of categories and small sample size, even the GFfit statistic may have low power and inaccurate Type I error level due to sparseness in the two-way subtable. In this dissertation, the theoretical power and empirical power of the GFfit statistic are studied. A method based on subsets of orthogonal components for the GFfit statistic on the subtables is developed to improve the performance of the GFfit statistic. Simulation results for power and type I error rate for several different cases along with comparisons to other diagnostics are presented.

Contributors

Agent

Created

Date Created
2017

158387-Thumbnail Image.png

Spatial Mortality Modeling in Actuarial Science

Description

Modeling human survivorship is a core area of research within the actuarial com

munity. With life insurance policies and annuity products as dominant financial

instruments which depend on future mortality rates, there is a risk that observed

human mortality experiences will differ from

Modeling human survivorship is a core area of research within the actuarial com

munity. With life insurance policies and annuity products as dominant financial

instruments which depend on future mortality rates, there is a risk that observed

human mortality experiences will differ from projected when they are sold. From an

insurer’s portfolio perspective, to curb this risk, it is imperative that models of hu

man survivorship are constantly being updated and equipped to accurately gauge and

forecast mortality rates. At present, the majority of actuarial research in mortality

modeling involves factor-based approaches which operate at a global scale, placing

little attention on the determinants and interpretable risk factors of mortality, specif

ically from a spatial perspective. With an abundance of research being performed

in the field of spatial statistics and greater accessibility to localized mortality data,

there is a clear opportunity to extend the existing body of mortality literature to

wards the spatial domain. It is the objective of this dissertation to introduce these

new statistical approaches to equip the field of actuarial science to include geographic

space into the mortality modeling context.

First, this dissertation evaluates the underlying spatial patterns of mortality across

the United States, and introduces a spatial filtering methodology to generate latent

spatial patterns which capture the essence of these mortality rates in space. Second,

local modeling techniques are illustrated, and a multiscale geographically weighted

regression (MGWR) model is generated to describe the variation of mortality rates

across space in an interpretable manner which allows for the investigation of the

presence of spatial variability in the determinants of mortality. Third, techniques for

updating traditional mortality models are introduced, culminating in the development

of a model which addresses the relationship between space, economic growth, and

mortality. It is through these applications that this dissertation demonstrates the

utility in updating actuarial mortality models from a spatial perspective.

Contributors

Agent

Created

Date Created
2020

158415-Thumbnail Image.png

Essays on the Modeling of Binary Longitudinal Data with Time-dependent Covariates

Description

Longitudinal studies contain correlated data due to the repeated measurements on the same subject. The changing values of the time-dependent covariates and their association with the outcomes presents another source of correlation. Most methods used to analyze longitudinal data average

Longitudinal studies contain correlated data due to the repeated measurements on the same subject. The changing values of the time-dependent covariates and their association with the outcomes presents another source of correlation. Most methods used to analyze longitudinal data average the effects of time-dependent covariates on outcomes over time and provide a single regression coefficient per time-dependent covariate. This denies researchers the opportunity to follow the changing impact of time-dependent covariates on the outcomes. This dissertation addresses such issue through the use of partitioned regression coefficients in three different papers.

In the first paper, an alternative approach to the partitioned Generalized Method of Moments logistic regression model for longitudinal binary outcomes is presented. This method relies on Bayes estimators and is utilized when the partitioned Generalized Method of Moments model provides numerically unstable estimates of the regression coefficients. It is used to model obesity status in the Add Health study and cognitive impairment diagnosis in the National Alzheimer’s Coordination Center database.

The second paper develops a model that allows the joint modeling of two or more binary outcomes that provide an overall measure of a subject’s trait over time. The simultaneous modelling of all outcomes provides a complete picture of the overall measure of interest. This approach accounts for the correlation among and between the outcomes across time and the changing effects of time-dependent covariates on the outcomes. The model is used to analyze four outcomes measuring overall the quality of life in the Chinese Longitudinal Healthy Longevity Study.

The third paper presents an approach that allows for estimation of cross-sectional and lagged effects of the covariates on the outcome as well as the feedback of the response on future covariates. This is done in two-parts, in part-1, the effects of time-dependent covariates on the outcomes are estimated, then, in part-2, the outcome influences on future values of the covariates are measured. These model parameters are obtained through a Generalized Method of Moments procedure that uses valid moment conditions between the outcome and the covariates. Child morbidity in the Philippines and obesity status in the Add Health data are analyzed.

Contributors

Agent

Created

Date Created
2020

158061-Thumbnail Image.png

Locally Optimal Experimental Designs for Mixed Responses Models

Description

Bivariate responses that comprise mixtures of binary and continuous variables are common in medical, engineering, and other scientific fields. There exist many works concerning the analysis of such mixed data. However, the research on optimal designs for this type

Bivariate responses that comprise mixtures of binary and continuous variables are common in medical, engineering, and other scientific fields. There exist many works concerning the analysis of such mixed data. However, the research on optimal designs for this type of experiments is still scarce. The joint mixed responses model that is considered here involves a mixture of ordinary linear models for the continuous response and a generalized linear model for the binary response. Using the complete class approach, tighter upper bounds on the number of support points required for finding locally optimal designs are derived for the mixed responses models studied in this work.

In the first part of this dissertation, a theoretical result was developed to facilitate the search of locally symmetric optimal designs for mixed responses models with one continuous covariate. Then, the study was extended to mixed responses models that include group effects. Two types of mixed responses models with group effects were investigated. The first type includes models having no common parameters across subject group, and the second type of models allows some common parameters (e.g., a common slope) across groups. In addition to complete class results, an efficient algorithm (PSO-FM) was proposed to search for the A- and D-optimal designs. Finally, the first-order mixed responses model is extended to a type of a quadratic mixed responses model with a quadratic polynomial predictor placed in its linear model.

Contributors

Agent

Created

Date Created
2020

158850-Thumbnail Image.png

Spatial Regression and Gaussian Process BART

Description

Spatial regression is one of the central topics in spatial statistics. Based on the goals, interpretation or prediction, spatial regression models can be classified into two categories, linear mixed regression models and nonlinear regression models. This dissertation explored these models

Spatial regression is one of the central topics in spatial statistics. Based on the goals, interpretation or prediction, spatial regression models can be classified into two categories, linear mixed regression models and nonlinear regression models. This dissertation explored these models and their real world applications. New methods and models were proposed to overcome the challenges in practice. There are three major parts in the dissertation.

In the first part, nonlinear regression models were embedded into a multistage workflow to predict the spatial abundance of reef fish species in the Gulf of Mexico. There were two challenges, zero-inflated data and out of sample prediction. The methods and models in the workflow could effectively handle the zero-inflated sampling data without strong assumptions. Three strategies were proposed to solve the out of sample prediction problem. The results and discussions showed that the nonlinear prediction had the advantages of high accuracy, low bias and well-performed in multi-resolution.

In the second part, a two-stage spatial regression model was proposed for analyzing soil carbon stock (SOC) data. In the first stage, there was a spatial linear mixed model that captured the linear and stationary effects. In the second stage, a generalized additive model was used to explain the nonlinear and nonstationary effects. The results illustrated that the two-stage model had good interpretability in understanding the effect of covariates, meanwhile, it kept high prediction accuracy which is competitive to the popular machine learning models, like, random forest, xgboost and support vector machine.

A new nonlinear regression model, Gaussian process BART (Bayesian additive regression tree), was proposed in the third part. Combining advantages in both BART and Gaussian process, the model could capture the nonlinear effects of both observed and latent covariates. To develop the model, first, the traditional BART was generalized to accommodate correlated errors. Then, the failure of likelihood based Markov chain Monte Carlo (MCMC) in parameter estimating was discussed. Based on the idea of analysis of variation, back comparing and tuning range, were proposed to tackle this failure. Finally, effectiveness of the new model was examined by experiments on both simulation and real data.

Contributors

Agent

Created

Date Created
2020