Matching Items (36)

158061-Thumbnail Image.png

Locally Optimal Experimental Designs for Mixed Responses Models

Description

Bivariate responses that comprise mixtures of binary and continuous variables are common in medical, engineering, and other scientific fields. There exist many works concerning the analysis of such mixed

Bivariate responses that comprise mixtures of binary and continuous variables are common in medical, engineering, and other scientific fields. There exist many works concerning the analysis of such mixed data. However, the research on optimal designs for this type of experiments is still scarce. The joint mixed responses model that is considered here involves a mixture of ordinary linear models for the continuous response and a generalized linear model for the binary response. Using the complete class approach, tighter upper bounds on the number of support points required for finding locally optimal designs are derived for the mixed responses models studied in this work.

In the first part of this dissertation, a theoretical result was developed to facilitate the search of locally symmetric optimal designs for mixed responses models with one continuous covariate. Then, the study was extended to mixed responses models that include group effects. Two types of mixed responses models with group effects were investigated. The first type includes models having no common parameters across subject group, and the second type of models allows some common parameters (e.g., a common slope) across groups. In addition to complete class results, an efficient algorithm (PSO-FM) was proposed to search for the A- and D-optimal designs. Finally, the first-order mixed responses model is extended to a type of a quadratic mixed responses model with a quadratic polynomial predictor placed in its linear model.

Contributors

Agent

Created

Date Created
  • 2020

152220-Thumbnail Image.png

A continuous latent factor model for non-ignorable missing data in longitudinal studies

Description

Many longitudinal studies, especially in clinical trials, suffer from missing data issues. Most estimation procedures assume that the missing values are ignorable or missing at random (MAR). However, this assumption

Many longitudinal studies, especially in clinical trials, suffer from missing data issues. Most estimation procedures assume that the missing values are ignorable or missing at random (MAR). However, this assumption leads to unrealistic simplification and is implausible for many cases. For example, an investigator is examining the effect of treatment on depression. Subjects are scheduled with doctors on a regular basis and asked questions about recent emotional situations. Patients who are experiencing severe depression are more likely to miss an appointment and leave the data missing for that particular visit. Data that are not missing at random may produce bias in results if the missing mechanism is not taken into account. In other words, the missing mechanism is related to the unobserved responses. Data are said to be non-ignorable missing if the probabilities of missingness depend on quantities that might not be included in the model. Classical pattern-mixture models for non-ignorable missing values are widely used for longitudinal data analysis because they do not require explicit specification of the missing mechanism, with the data stratified according to a variety of missing patterns and a model specified for each stratum. However, this usually results in under-identifiability, because of the need to estimate many stratum-specific parameters even though the eventual interest is usually on the marginal parameters. Pattern mixture models have the drawback that a large sample is usually required. In this thesis, two studies are presented. The first study is motivated by an open problem from pattern mixture models. Simulation studies from this part show that information in the missing data indicators can be well summarized by a simple continuous latent structure, indicating that a large number of missing data patterns may be accounted by a simple latent factor. Simulation findings that are obtained in the first study lead to a novel model, a continuous latent factor model (CLFM). The second study develops CLFM which is utilized for modeling the joint distribution of missing values and longitudinal outcomes. The proposed CLFM model is feasible even for small sample size applications. The detailed estimation theory, including estimating techniques from both frequentist and Bayesian perspectives is presented. Model performance and evaluation are studied through designed simulations and three applications. Simulation and application settings change from correctly-specified missing data mechanism to mis-specified mechanism and include different sample sizes from longitudinal studies. Among three applications, an AIDS study includes non-ignorable missing values; the Peabody Picture Vocabulary Test data have no indication on missing data mechanism and it will be applied to a sensitivity analysis; the Growth of Language and Early Literacy Skills in Preschoolers with Developmental Speech and Language Impairment study, however, has full complete data and will be used to conduct a robust analysis. The CLFM model is shown to provide more precise estimators, specifically on intercept and slope related parameters, compared with Roy's latent class model and the classic linear mixed model. This advantage will be more obvious when a small sample size is the case, where Roy's model experiences challenges on estimation convergence. The proposed CLFM model is also robust when missing data are ignorable as demonstrated through a study on Growth of Language and Early Literacy Skills in Preschoolers.

Contributors

Agent

Created

Date Created
  • 2013

156163-Thumbnail Image.png

Essays on the identification and modeling of variance

Description

In the presence of correlation, generalized linear models cannot be employed to obtain regression parameter estimates. To appropriately address the extravariation due to correlation, methods to estimate and model the

In the presence of correlation, generalized linear models cannot be employed to obtain regression parameter estimates. To appropriately address the extravariation due to correlation, methods to estimate and model the additional variation are investigated. A general form of the mean-variance relationship is proposed which incorporates the canonical parameter. The two variance parameters are estimated using generalized method of moments, negating the need for a distributional assumption. The mean-variance relation estimates are applied to clustered data and implemented in an adjusted generalized quasi-likelihood approach through an adjustment to the covariance matrix. In the presence of significant correlation in hierarchical structured data, the adjusted generalized quasi-likelihood model shows improved performance for random effect estimates. In addition, submodels to address deviation in skewness and kurtosis are provided to jointly model the mean, variance, skewness, and kurtosis. The additional models identify covariates influencing the third and fourth moments. A cutoff to trim the data is provided which improves parameter estimation and model fit. For each topic, findings are demonstrated through comprehensive simulation studies and numerical examples. Examples evaluated include data on children’s morbidity in the Philippines, adolescent health from the National Longitudinal Study of Adolescent to Adult Health, as well as proteomic assays for breast cancer screening.

Contributors

Agent

Created

Date Created
  • 2018

156264-Thumbnail Image.png

A study of components of Pearson's chi-square based on marginal distributions of cross-classified tables for binary variables

Description

The Pearson and likelihood ratio statistics are well-known in goodness-of-fit testing and are commonly used for models applied to multinomial count data. When data are from a table formed by

The Pearson and likelihood ratio statistics are well-known in goodness-of-fit testing and are commonly used for models applied to multinomial count data. When data are from a table formed by the cross-classification of a large number of variables, these goodness-of-fit statistics may have lower power and inaccurate Type I error rate due to sparseness. Pearson's statistic can be decomposed into orthogonal components associated with the marginal distributions of observed variables, and an omnibus fit statistic can be obtained as a sum of these components. When the statistic is a sum of components for lower-order marginals, it has good performance for Type I error rate and statistical power even when applied to a sparse table. In this dissertation, goodness-of-fit statistics using orthogonal components based on second- third- and fourth-order marginals were examined. If lack-of-fit is present in higher-order marginals, then a test that incorporates the higher-order marginals may have a higher power than a test that incorporates only first- and/or second-order marginals. To this end, two new statistics based on the orthogonal components of Pearson's chi-square that incorporate third- and fourth-order marginals were developed, and the Type I error, empirical power, and asymptotic power under different sparseness conditions were investigated. Individual orthogonal components as test statistics to identify lack-of-fit were also studied. The performance of individual orthogonal components to other popular lack-of-fit statistics were also compared. When the number of manifest variables becomes larger than 20, most of the statistics based on marginal distributions have limitations in terms of computer resources and CPU time. Under this problem, when the number manifest variables is larger than or equal to 20, the performance of a bootstrap based method to obtain p-values for Pearson-Fisher statistic, fit to confirmatory dichotomous variable factor analysis model, and the performance of Tollenaar and Mooijaart (2003) statistic were investigated.

Contributors

Agent

Created

Date Created
  • 2018

157227-Thumbnail Image.png

Students’ Meanings for Stochastic Process While Developing a Conception of Distribution

Description

The concept of distribution is one of the core ideas of probability theory and inferential statistics, if not the core idea. Many introductory statistics textbooks pay lip service to

The concept of distribution is one of the core ideas of probability theory and inferential statistics, if not the core idea. Many introductory statistics textbooks pay lip service to stochastic/random processes but how do students think about these processes? This study sought to explore what understandings of stochastic process students develop as they work through materials intended to support them in constructing the long-run behavior meaning for distribution.

I collected data in three phases. First, I conducted a set of task-based clinical interviews that allowed me to build initial models for the students’ meanings for randomness and probability. Second, I worked with Bonnie in an exploratory teaching setting through three sets of activities to see what meanings she would develop for randomness and stochastic process. The final phase consisted of me working with Danielle as she worked through the same activities as Bonnie but this time in teaching experiment setting where I used a series of interventions to test out how Danielle was thinking about stochastic processes.

My analysis shows that students can be aware that the word “random” lives in two worlds, thereby having conflicting meanings. Bonnie’s meaning for randomness evolved over the course of the study from an unproductive meaning centered on the emotions of the characters in the context to a meaning that randomness is the lack of a pattern. Bonnie’s lack of pattern meaning for randomness subsequently underpinned her image of stochastic/processes, leading her to engage in pattern-hunting behavior every time she needed to classify a process as stochastic or not. Danielle’s image of a stochastic process was grounded in whether she saw the repetition as being reproducible (process can be repeated, and outcomes are identical to prior time through the process) or replicable (process can be repeated but the outcomes aren’t in the same order as before). Danielle employed a strategy of carrying out several trials of the process, resetting the applet, and then carrying out the process again, making replicability central to her thinking.

Contributors

Agent

Created

Date Created
  • 2019

157893-Thumbnail Image.png

Maximin designs for event-related fMRI with uncertain error correlation

Description

One of the premier technologies for studying human brain functions is the event-related functional magnetic resonance imaging (fMRI). The main design issue for such experiments is to find the optimal

One of the premier technologies for studying human brain functions is the event-related functional magnetic resonance imaging (fMRI). The main design issue for such experiments is to find the optimal sequence for mental stimuli. This optimal design sequence allows for collecting informative data to make precise statistical inferences about the inner workings of the brain. Unfortunately, this is not an easy task, especially when the error correlation of the response is unknown at the design stage. In the literature, the maximin approach was proposed to tackle this problem. However, this is an expensive and time-consuming method, especially when the correlated noise follows high-order autoregressive models. The main focus of this dissertation is to develop an efficient approach to reduce the amount of the computational resources needed to obtain A-optimal designs for event-related fMRI experiments. One proposed idea is to combine the Kriging approximation method, which is widely used in spatial statistics and computer experiments with a knowledge-based genetic algorithm. Through case studies, a demonstration is made to show that the new search method achieves similar design efficiencies as those attained by the traditional method, but the new method gives a significant reduction in computing time. Another useful strategy is also proposed to find such designs by considering only the boundary points of the parameter space of the correlation parameters. The usefulness of this strategy is also demonstrated via case studies. The first part of this dissertation focuses on finding optimal event-related designs for fMRI with simple trials when each stimulus consists of only one component (e.g., a picture). The study is then extended to the case of compound trials when stimuli of multiple components (e.g., a cue followed by a picture) are considered.

Contributors

Agent

Created

Date Created
  • 2019

158208-Thumbnail Image.png

Optimal Sampling Designs for Functional Data Analysis

Description

Functional regression models are widely considered in practice. To precisely understand an underlying functional mechanism, a good sampling schedule for collecting informative functional data is necessary, especially when data collection

Functional regression models are widely considered in practice. To precisely understand an underlying functional mechanism, a good sampling schedule for collecting informative functional data is necessary, especially when data collection is limited. However, scarce research has been conducted on the optimal sampling schedule design for the functional regression model so far. To address this design issue, efficient approaches are proposed for generating the best sampling plan in the functional regression setting. First, three optimal experimental designs are considered under a function-on-function linear model: the schedule that maximizes the relative efficiency for recovering the predictor function, the schedule that maximizes the relative efficiency for predicting the response function, and the schedule that maximizes the mixture of the relative efficiencies of both the predictor and response functions. The obtained sampling plan allows a precise recovery of the predictor function and a precise prediction of the response function. The proposed approach can also be reduced to identify the optimal sampling plan for the problem with a scalar-on-function linear regression model. In addition, the optimality criterion on predicting a scalar response using a functional predictor is derived when the quadratic relationship between these two variables is present, and proofs of important properties of the derived optimality criterion are also provided. To find such designs, an algorithm that is comparably fast, and can generate nearly optimal designs is proposed. As the optimality criterion includes quantities that must be estimated from prior knowledge (e.g., a pilot study), the effectiveness of the suggested optimal design highly depends on the quality of the estimates. However, in many situations, the estimates are unreliable; thus, a bootstrap aggregating (bagging) approach is employed for enhancing the quality of estimates and for finding sampling schedules stable to the misspecification of estimates. Through case studies, it is demonstrated that the proposed designs outperform other designs in terms of accurately predicting the response and recovering the predictor. It is also proposed that bagging-enhanced design generates a more robust sampling design under the misspecification of estimated quantities.

Contributors

Agent

Created

Date Created
  • 2020

158282-Thumbnail Image.png

Comparison of Denominator Degrees of Freedom Approximations for Linear Mixed Models in Small-Sample Simulations

Description

Whilst linear mixed models offer a flexible approach to handle data with multiple sources of random variability, the related hypothesis testing for the fixed effects often encounters obstacles when the

Whilst linear mixed models offer a flexible approach to handle data with multiple sources of random variability, the related hypothesis testing for the fixed effects often encounters obstacles when the sample size is small and the underlying distribution for the test statistic is unknown. Consequently, five methods of denominator degrees of freedom approximations (residual, containment, between-within, Satterthwaite, Kenward-Roger) are developed to overcome this problem. This study aims to evaluate the performance of these five methods with a mixed model consisting of random intercept and random slope. Specifically, simulations are conducted to provide insights on the F-statistics, denominator degrees of freedom and p-values each method gives with respect to different settings of the sample structure, the fixed-effect slopes and the missing-data proportion. The simulation results show that the residual method performs the worst in terms of F-statistics and p-values. Also, Satterthwaite and Kenward-Roger methods tend to be more sensitive to the change of designs. The Kenward-Roger method performs the best in terms of F-statistics when the null hypothesis is true.

Contributors

Agent

Created

Date Created
  • 2020

155598-Thumbnail Image.png

An information based optimal subdata selection algorithm for big data linear regression and a suitable variable selection algorithm

Description

This article proposes a new information-based subdata selection (IBOSS) algorithm, Squared Scaled Distance Algorithm (SSDA). It is based on the invariance of the determinant of the information matrix under orthogonal

This article proposes a new information-based subdata selection (IBOSS) algorithm, Squared Scaled Distance Algorithm (SSDA). It is based on the invariance of the determinant of the information matrix under orthogonal transformations, especially rotations. Extensive simulation results show that the new IBOSS algorithm retains nice asymptotic properties of IBOSS and gives a larger determinant of the subdata information matrix. It has the same order of time complexity as the D-optimal IBOSS algorithm. However, it exploits the advantages of vectorized calculation avoiding for loops and is approximately 6 times as fast as the D-optimal IBOSS algorithm in R. The robustness of SSDA is studied from three aspects: nonorthogonality, including interaction terms and variable misspecification. A new accurate variable selection algorithm is proposed to help the implementation of IBOSS algorithms when a large number of variables are present with sparse important variables among them. Aggregating random subsample results, this variable selection algorithm is much more accurate than the LASSO method using full data. Since the time complexity is associated with the number of variables only, it is also very computationally efficient if the number of variables is fixed as n increases and not massively large. More importantly, using subsamples it solves the problem that full data cannot be stored in the memory when a data set is too large.

Contributors

Agent

Created

Date Created
  • 2017

155978-Thumbnail Image.png

Three essays on comparative simulation in three-level hierarchical data structure

Description

Though the likelihood is a useful tool for obtaining estimates of regression parameters, it is not readily available in the fit of hierarchical binary data models. The correlated observations negate

Though the likelihood is a useful tool for obtaining estimates of regression parameters, it is not readily available in the fit of hierarchical binary data models. The correlated observations negate the opportunity to have a joint likelihood when fitting hierarchical logistic regression models. Through conditional likelihood, inferences for the regression and covariance parameters as well as the intraclass correlation coefficients are usually obtained. In those cases, I have resorted to use of Laplace approximation and large sample theory approach for point and interval estimates such as Wald-type confidence intervals and profile likelihood confidence intervals. These methods rely on distributional assumptions and large sample theory. However, when dealing with small hierarchical datasets they often result in severe bias or non-convergence. I present a generalized quasi-likelihood approach and a generalized method of moments approach; both do not rely on any distributional assumptions but only moments of response. As an alternative to the typical large sample theory approach, I present bootstrapping hierarchical logistic regression models which provides more accurate interval estimates for small binary hierarchical data. These models substitute computations as an alternative to the traditional Wald-type and profile likelihood confidence intervals. I use a latent variable approach with a new split bootstrap method for estimating intraclass correlation coefficients when analyzing binary data obtained from a three-level hierarchical structure. It is especially useful with small sample size and easily expanded to multilevel. Comparisons are made to existing approaches through both theoretical justification and simulation studies. Further, I demonstrate my findings through an analysis of three numerical examples, one based on cancer in remission data, one related to the China’s antibiotic abuse study, and a third related to teacher effectiveness in schools from a state of southwest US.

Contributors

Agent

Created

Date Created
  • 2017