Matching Items (13)

Filtering by

Clear all filters

153860-Thumbnail Image.png

Threshold regression estimation via lasso, elastic-net, and lad-lasso: a simulation study with applications to urban traffic data

Description

Threshold regression is used to model regime switching dynamics where the effects of the explanatory variables in predicting the response variable depend on whether a certain threshold has been crossed. When regime-switching dynamics are present, new estimation problems arise related

Threshold regression is used to model regime switching dynamics where the effects of the explanatory variables in predicting the response variable depend on whether a certain threshold has been crossed. When regime-switching dynamics are present, new estimation problems arise related to estimating the value of the threshold. Conventional methods utilize an iterative search procedure, seeking to minimize the sum of squares criterion. However, when unnecessary variables are included in the model or certain variables drop out of the model depending on the regime, this method may have high variability. This paper proposes Lasso-type methods as an alternative to ordinary least squares. By incorporating an L_{1} penalty term, Lasso methods perform variable selection, thus potentially reducing some of the variance in estimating the threshold parameter. This paper discusses the results of a study in which two different underlying model structures were simulated. The first is a regression model with correlated predictors, whereas the second is a self-exciting threshold autoregressive model. Finally the proposed Lasso-type methods are compared to conventional methods in an application to urban traffic data.

Contributors

Agent

Created

Date Created
2015

153049-Thumbnail Image.png

Robust experimental designs for fMRI with an uncertain design matrix

Description

Obtaining high-quality experimental designs to optimize statistical efficiency and data quality is quite challenging for functional magnetic resonance imaging (fMRI). The primary fMRI design issue is on the selection of the best sequence of stimuli based on a statistically meaningful

Obtaining high-quality experimental designs to optimize statistical efficiency and data quality is quite challenging for functional magnetic resonance imaging (fMRI). The primary fMRI design issue is on the selection of the best sequence of stimuli based on a statistically meaningful optimality criterion. Some previous studies have provided some guidance and powerful computational tools for obtaining good fMRI designs. However, these results are mainly for basic experimental settings with simple statistical models. In this work, a type of modern fMRI experiments is considered, in which the design matrix of the statistical model depends not only on the selected design, but also on the experimental subject's probabilistic behavior during the experiment. The design matrix is thus uncertain at the design stage, making it diffcult to select good designs. By taking this uncertainty into account, a very efficient approach for obtaining high-quality fMRI designs is developed in this study. The proposed approach is built upon an analytical result, and an efficient computer algorithm. It is shown through case studies that the proposed approach can outperform an existing method in terms of computing time, and the quality of the obtained designs.

Contributors

Agent

Created

Date Created
2014

156371-Thumbnail Image.png

Locally D-optimal designs for generalized linear models

Description

Generalized Linear Models (GLMs) are widely used for modeling responses with non-normal error distributions. When the values of the covariates in such models are controllable, finding an optimal (or at least efficient) design could greatly facilitate the work of collecting

Generalized Linear Models (GLMs) are widely used for modeling responses with non-normal error distributions. When the values of the covariates in such models are controllable, finding an optimal (or at least efficient) design could greatly facilitate the work of collecting and analyzing data. In fact, many theoretical results are obtained on a case-by-case basis, while in other situations, researchers also rely heavily on computational tools for design selection.

Three topics are investigated in this dissertation with each one focusing on one type of GLMs. Topic I considers GLMs with factorial effects and one continuous covariate. Factors can have interactions among each other and there is no restriction on the possible values of the continuous covariate. The locally D-optimal design structures for such models are identified and results for obtaining smaller optimal designs using orthogonal arrays (OAs) are presented. Topic II considers GLMs with multiple covariates under the assumptions that all but one covariate are bounded within specified intervals and interaction effects among those bounded covariates may also exist. An explicit formula for D-optimal designs is derived and OA-based smaller D-optimal designs for models with one or two two-factor interactions are also constructed. Topic III considers multiple-covariate logistic models. All covariates are nonnegative and there is no interaction among them. Two types of D-optimal design structures are identified and their global D-optimality is proved using the celebrated equivalence theorem.

Contributors

Agent

Created

Date Created
2018

157893-Thumbnail Image.png

Maximin designs for event-related fMRI with uncertain error correlation

Description

One of the premier technologies for studying human brain functions is the event-related functional magnetic resonance imaging (fMRI). The main design issue for such experiments is to find the optimal sequence for mental stimuli. This optimal design sequence allows for

One of the premier technologies for studying human brain functions is the event-related functional magnetic resonance imaging (fMRI). The main design issue for such experiments is to find the optimal sequence for mental stimuli. This optimal design sequence allows for collecting informative data to make precise statistical inferences about the inner workings of the brain. Unfortunately, this is not an easy task, especially when the error correlation of the response is unknown at the design stage. In the literature, the maximin approach was proposed to tackle this problem. However, this is an expensive and time-consuming method, especially when the correlated noise follows high-order autoregressive models. The main focus of this dissertation is to develop an efficient approach to reduce the amount of the computational resources needed to obtain A-optimal designs for event-related fMRI experiments. One proposed idea is to combine the Kriging approximation method, which is widely used in spatial statistics and computer experiments with a knowledge-based genetic algorithm. Through case studies, a demonstration is made to show that the new search method achieves similar design efficiencies as those attained by the traditional method, but the new method gives a significant reduction in computing time. Another useful strategy is also proposed to find such designs by considering only the boundary points of the parameter space of the correlation parameters. The usefulness of this strategy is also demonstrated via case studies. The first part of this dissertation focuses on finding optimal event-related designs for fMRI with simple trials when each stimulus consists of only one component (e.g., a picture). The study is then extended to the case of compound trials when stimuli of multiple components (e.g., a cue followed by a picture) are considered.

Contributors

Agent

Created

Date Created
2019

155868-Thumbnail Image.png

Optimal Experimental Designs for Mixed Categorical and Continuous Responses

Description

This study concerns optimal designs for experiments where responses consist of both binary and continuous variables. Many experiments in engineering, medical studies, and other fields have such mixed responses. Although in recent decades several statistical methods have been developed for

This study concerns optimal designs for experiments where responses consist of both binary and continuous variables. Many experiments in engineering, medical studies, and other fields have such mixed responses. Although in recent decades several statistical methods have been developed for jointly modeling both types of response variables, an effective way to design such experiments remains unclear. To address this void, some useful results are developed to guide the selection of optimal experimental designs in such studies. The results are mainly built upon a powerful tool called the complete class approach and a nonlinear optimization algorithm. The complete class approach was originally developed for a univariate response, but it is extended to the case of bivariate responses of mixed variable types. Consequently, the number of candidate designs are significantly reduced. An optimization algorithm is then applied to efficiently search the small class of candidate designs for the D- and A-optimal designs. Furthermore, the optimality of the obtained designs is verified by the general equivalence theorem. In the first part of the study, the focus is on a simple, first-order model. The study is expanded to a model with a quadratic polynomial predictor. The obtained designs can help to render a precise statistical inference in practice or serve as a benchmark for evaluating the quality of other designs.

Contributors

Agent

Created

Date Created
2017

155598-Thumbnail Image.png

An information based optimal subdata selection algorithm for big data linear regression and a suitable variable selection algorithm

Description

This article proposes a new information-based subdata selection (IBOSS) algorithm, Squared Scaled Distance Algorithm (SSDA). It is based on the invariance of the determinant of the information matrix under orthogonal transformations, especially rotations. Extensive simulation results show that the new

This article proposes a new information-based subdata selection (IBOSS) algorithm, Squared Scaled Distance Algorithm (SSDA). It is based on the invariance of the determinant of the information matrix under orthogonal transformations, especially rotations. Extensive simulation results show that the new IBOSS algorithm retains nice asymptotic properties of IBOSS and gives a larger determinant of the subdata information matrix. It has the same order of time complexity as the D-optimal IBOSS algorithm. However, it exploits the advantages of vectorized calculation avoiding for loops and is approximately 6 times as fast as the D-optimal IBOSS algorithm in R. The robustness of SSDA is studied from three aspects: nonorthogonality, including interaction terms and variable misspecification. A new accurate variable selection algorithm is proposed to help the implementation of IBOSS algorithms when a large number of variables are present with sparse important variables among them. Aggregating random subsample results, this variable selection algorithm is much more accurate than the LASSO method using full data. Since the time complexity is associated with the number of variables only, it is also very computationally efficient if the number of variables is fixed as n increases and not massively large. More importantly, using subsamples it solves the problem that full data cannot be stored in the memory when a data set is too large.

Contributors

Agent

Created

Date Created
2017

157719-Thumbnail Image.png

Experimental design issues in functional brain imaging with high temporal resolution

Description

Functional brain imaging experiments are widely conducted in many fields for study- ing the underlying brain activity in response to mental stimuli. For such experiments, it is crucial to select a good sequence of mental stimuli that allow researchers to

Functional brain imaging experiments are widely conducted in many fields for study- ing the underlying brain activity in response to mental stimuli. For such experiments, it is crucial to select a good sequence of mental stimuli that allow researchers to collect informative data for making precise and valid statistical inferences at minimum cost. In contrast to most existing studies, the aim of this study is to obtain optimal designs for brain mapping technology with an ultra-high temporal resolution with respect to some common statistical optimality criteria. The first topic of this work is on finding optimal designs when the primary interest is in estimating the Hemodynamic Response Function (HRF), a function of time describing the effect of a mental stimulus to the brain. A major challenge here is that the design matrix of the statistical model is greatly enlarged. As a result, it is very difficult, if not infeasible, to compute and compare the statistical efficiencies of competing designs. For tackling this issue, an efficient approach is built on subsampling the design matrix and the use of an efficient computer algorithm is proposed. It is demonstrated through the analytical and simulation results that the proposed approach can outperform the existing methods in terms of computing time, and the quality of the obtained designs. The second topic of this work is to find optimal designs when another set of popularly used basis functions is considered for modeling the HRF, e.g., to detect brain activations. Although the statistical model for analyzing the data remains linear, the parametric functions of interest under this setting are often nonlinear. The quality of the de- sign will then depend on the true value of some unknown parameters. To address this issue, the maximin approach is considered to identify designs that maximize the relative efficiencies over the parameter space. As shown in the case studies, these maximin designs yield high performance for detecting brain activation compared to the traditional designs that are widely used in practice.

Contributors

Agent

Created

Date Created
2019

156163-Thumbnail Image.png

Essays on the identification and modeling of variance

Description

In the presence of correlation, generalized linear models cannot be employed to obtain regression parameter estimates. To appropriately address the extravariation due to correlation, methods to estimate and model the additional variation are investigated. A general form of the mean-variance

In the presence of correlation, generalized linear models cannot be employed to obtain regression parameter estimates. To appropriately address the extravariation due to correlation, methods to estimate and model the additional variation are investigated. A general form of the mean-variance relationship is proposed which incorporates the canonical parameter. The two variance parameters are estimated using generalized method of moments, negating the need for a distributional assumption. The mean-variance relation estimates are applied to clustered data and implemented in an adjusted generalized quasi-likelihood approach through an adjustment to the covariance matrix. In the presence of significant correlation in hierarchical structured data, the adjusted generalized quasi-likelihood model shows improved performance for random effect estimates. In addition, submodels to address deviation in skewness and kurtosis are provided to jointly model the mean, variance, skewness, and kurtosis. The additional models identify covariates influencing the third and fourth moments. A cutoff to trim the data is provided which improves parameter estimation and model fit. For each topic, findings are demonstrated through comprehensive simulation studies and numerical examples. Examples evaluated include data on children’s morbidity in the Philippines, adolescent health from the National Longitudinal Study of Adolescent to Adult Health, as well as proteomic assays for breast cancer screening.

Contributors

Agent

Created

Date Created
2018

155445-Thumbnail Image.png

A power study of Gffit statistics as somponents of Pearson chi-square

Description

The Pearson and likelihood ratio statistics are commonly used to test goodness-of-fit for models applied to data from a multinomial distribution. When data are from a table formed by cross-classification of a large number of variables, the common statistics may

The Pearson and likelihood ratio statistics are commonly used to test goodness-of-fit for models applied to data from a multinomial distribution. When data are from a table formed by cross-classification of a large number of variables, the common statistics may have low power and inaccurate Type I error level due to sparseness in the cells of the table. The GFfit statistic can be used to examine model fit in subtables. It is proposed to assess model fit by using a new version of GFfit statistic based on orthogonal components of Pearson chi-square as a diagnostic to examine the fit on two-way subtables. However, due to variables with a large number of categories and small sample size, even the GFfit statistic may have low power and inaccurate Type I error level due to sparseness in the two-way subtable. In this dissertation, the theoretical power and empirical power of the GFfit statistic are studied. A method based on subsets of orthogonal components for the GFfit statistic on the subtables is developed to improve the performance of the GFfit statistic. Simulation results for power and type I error rate for several different cases along with comparisons to other diagnostics are presented.

Contributors

Agent

Created

Date Created
2017

158520-Thumbnail Image.png

Contributions to Optimal Experimental Design and Strategic Subdata Selection for Big Data

Description

In this dissertation two research questions in the field of applied experimental design were explored. First, methods for augmenting the three-level screening designs called Definitive Screening Designs (DSDs) were investigated. Second, schemes for strategic subdata selection for nonparametric

In this dissertation two research questions in the field of applied experimental design were explored. First, methods for augmenting the three-level screening designs called Definitive Screening Designs (DSDs) were investigated. Second, schemes for strategic subdata selection for nonparametric predictive modeling with big data were developed.

Under sparsity, the structure of DSDs can allow for the screening and optimization of a system in one step, but in non-sparse situations estimation of second-order models requires augmentation of the DSD. In this work, augmentation strategies for DSDs were considered, given the assumption that the correct form of the model for the response of interest is quadratic. Series of augmented designs were constructed and explored, and power calculations, model-robustness criteria, model-discrimination criteria, and simulation study results were used to identify the number of augmented runs necessary for (1) effectively identifying active model effects, and (2) precisely predicting a response of interest. When the goal is identification of active effects, it is shown that supersaturated designs are sufficient; when the goal is prediction, it is shown that little is gained by augmenting beyond the design that is saturated for the full quadratic model. Surprisingly, augmentation strategies based on the I-optimality criterion do not lead to better predictions than strategies based on the D-optimality criterion.

Computational limitations can render standard statistical methods infeasible in the face of massive datasets, necessitating subsampling strategies. In the big data context, the primary objective is often prediction but the correct form of the model for the response of interest is likely unknown. Here, two new methods of subdata selection were proposed. The first is based on clustering, the second is based on space-filling designs, and both are free from model assumptions. The performance of the proposed methods was explored visually via low-dimensional simulated examples; via real data applications; and via large simulation studies. In all cases the proposed methods were compared to existing, widely used subdata selection methods. The conditions under which the proposed methods provide advantages over standard subdata selection strategies were identified.

Contributors

Agent

Created

Date Created
2020