Search Content

Three essays on comparative simulation in three-level hierarchical data structure

Description

Though the likelihood is a useful tool for obtaining estimates of regression parameters, it is not readily available in the fit of hierarchical binary data models. The correlated observations negate the opportunity to have a joint likelihood when fitting hierarchical logistic regression models. Through conditional likelihood, inferences for the regression…

Though the likelihood is a useful tool for obtaining estimates of regression parameters, it is not readily available in the fit of hierarchical binary data models. The correlated observations negate the opportunity to have a joint likelihood when fitting hierarchical logistic regression models. Through conditional likelihood, inferences for the regression and covariance parameters as well as the intraclass correlation coefficients are usually obtained. In those cases, I have resorted to use of Laplace approximation and large sample theory approach for point and interval estimates such as Wald-type confidence intervals and profile likelihood confidence intervals. These methods rely on distributional assumptions and large sample theory. However, when dealing with small hierarchical datasets they often result in severe bias or non-convergence. I present a generalized quasi-likelihood approach and a generalized method of moments approach; both do not rely on any distributional assumptions but only moments of response. As an alternative to the typical large sample theory approach, I present bootstrapping hierarchical logistic regression models which provides more accurate interval estimates for small binary hierarchical data. These models substitute computations as an alternative to the traditional Wald-type and profile likelihood confidence intervals. I use a latent variable approach with a new split bootstrap method for estimating intraclass correlation coefficients when analyzing binary data obtained from a three-level hierarchical structure. It is especially useful with small sample size and easily expanded to multilevel. Comparisons are made to existing approaches through both theoretical justification and simulation studies. Further, I demonstrate my findings through an analysis of three numerical examples, one based on cancer in remission data, one related to the China’s antibiotic abuse study, and a third related to teacher effectiveness in schools from a state of southwest US.

ContributorsWang, Bei (Author) / Wilson, Jeffrey R (Thesis advisor) / Kamarianakis, Ioannis (Committee member) / Reiser, Mark R. (Committee member) / St Louis, Robert (Committee member) / Zheng, Yi (Committee member) / Arizona State University (Publisher)

Created2017

Three essays on correlated binary outcomes: detection and appropriate models

Description

Correlation is common in many types of data, including those collected through longitudinal studies or in a hierarchical structure. In the case of clustering, or repeated measurements, there is inherent correlation between observations within the same group, or between observations obtained on the same subject. Longitudinal studies also introduce association…

Correlation is common in many types of data, including those collected through longitudinal studies or in a hierarchical structure. In the case of clustering, or repeated measurements, there is inherent correlation between observations within the same group, or between observations obtained on the same subject. Longitudinal studies also introduce association between the covariates and the outcomes across time. When multiple outcomes are of interest, association may exist between the various models. These correlations can lead to issues in model fitting and inference if not properly accounted for. This dissertation presents three papers discussing appropriate methods to properly consider different types of association. The first paper introduces an ANOVA based measure of intraclass correlation for three level hierarchical data with binary outcomes, and corresponding properties. This measure is useful for evaluating when the correlation due to clustering warrants a more complex model. This measure is used to investigate AIDS knowledge in a clustered study conducted in Bangladesh. The second paper develops the Partitioned generalized method of moments (Partitioned GMM) model for longitudinal studies. This model utilizes valid moment conditions to separately estimate the varying effects of each time-dependent covariate on the outcome over time using multiple coefficients. The model is fit to data from the National Longitudinal Study of Adolescent to Adult Health (Add Health) to investigate risk factors of childhood obesity. In the third paper, the Partitioned GMM model is extended to jointly estimate regression models for multiple outcomes of interest. Thus, this approach takes into account both the correlation between the multivariate outcomes, as well as the correlation due to time-dependency in longitudinal studies. The model utilizes an expanded weight matrix and objective function composed of valid moment conditions to simultaneously estimate optimal regression coefficients. This approach is applied to Add Health data to simultaneously study drivers of outcomes including smoking, social alcohol usage, and obesity in children.

ContributorsIrimata, Kyle (Author) / Wilson, Jeffrey R (Thesis advisor) / Broatch, Jennifer (Committee member) / Kamarianakis, Ioannis (Committee member) / Kao, Ming-Hung (Committee member) / Reiser, Mark R. (Committee member) / Arizona State University (Publisher)

Created2018

Locally D-optimal designs for generalized linear models

Description

Generalized Linear Models (GLMs) are widely used for modeling responses with non-normal error distributions. When the values of the covariates in such models are controllable, finding an optimal (or at least efficient) design could greatly facilitate the work of collecting and analyzing data. In fact, many theoretical results are obtained…

Generalized Linear Models (GLMs) are widely used for modeling responses with non-normal error distributions. When the values of the covariates in such models are controllable, finding an optimal (or at least efficient) design could greatly facilitate the work of collecting and analyzing data. In fact, many theoretical results are obtained on a case-by-case basis, while in other situations, researchers also rely heavily on computational tools for design selection.

Three topics are investigated in this dissertation with each one focusing on one type of GLMs. Topic I considers GLMs with factorial effects and one continuous covariate. Factors can have interactions among each other and there is no restriction on the possible values of the continuous covariate. The locally D-optimal design structures for such models are identified and results for obtaining smaller optimal designs using orthogonal arrays (OAs) are presented. Topic II considers GLMs with multiple covariates under the assumptions that all but one covariate are bounded within specified intervals and interaction effects among those bounded covariates may also exist. An explicit formula for D-optimal designs is derived and OA-based smaller D-optimal designs for models with one or two two-factor interactions are also constructed. Topic III considers multiple-covariate logistic models. All covariates are nonnegative and there is no interaction among them. Two types of D-optimal design structures are identified and their global D-optimality is proved using the celebrated equivalence theorem.

ContributorsWang, Zhongsheng (Author) / Stufken, John (Thesis advisor) / Kamarianakis, Ioannis (Committee member) / Kao, Ming-Hung (Committee member) / Reiser, Mark R. (Committee member) / Zheng, Yi (Committee member) / Arizona State University (Publisher)

Created2018

A study of components of Pearson's chi-square based on marginal distributions of cross-classified tables for binary variables

Description

The Pearson and likelihood ratio statistics are well-known in goodness-of-fit testing and are commonly used for models applied to multinomial count data. When data are from a table formed by the cross-classification of a large number of variables, these goodness-of-fit statistics may have lower power and inaccurate Type I error…

The Pearson and likelihood ratio statistics are well-known in goodness-of-fit testing and are commonly used for models applied to multinomial count data. When data are from a table formed by the cross-classification of a large number of variables, these goodness-of-fit statistics may have lower power and inaccurate Type I error rate due to sparseness. Pearson's statistic can be decomposed into orthogonal components associated with the marginal distributions of observed variables, and an omnibus fit statistic can be obtained as a sum of these components. When the statistic is a sum of components for lower-order marginals, it has good performance for Type I error rate and statistical power even when applied to a sparse table. In this dissertation, goodness-of-fit statistics using orthogonal components based on second- third- and fourth-order marginals were examined. If lack-of-fit is present in higher-order marginals, then a test that incorporates the higher-order marginals may have a higher power than a test that incorporates only first- and/or second-order marginals. To this end, two new statistics based on the orthogonal components of Pearson's chi-square that incorporate third- and fourth-order marginals were developed, and the Type I error, empirical power, and asymptotic power under different sparseness conditions were investigated. Individual orthogonal components as test statistics to identify lack-of-fit were also studied. The performance of individual orthogonal components to other popular lack-of-fit statistics were also compared. When the number of manifest variables becomes larger than 20, most of the statistics based on marginal distributions have limitations in terms of computer resources and CPU time. Under this problem, when the number manifest variables is larger than or equal to 20, the performance of a bootstrap based method to obtain p-values for Pearson-Fisher statistic, fit to confirmatory dichotomous variable factor analysis model, and the performance of Tollenaar and Mooijaart (2003) statistic were investigated.

ContributorsDassanayake, Mudiyanselage Maduranga Kasun (Author) / Reiser, Mark R. (Thesis advisor) / Kao, Ming-Hung (Committee member) / Wilson, Jeffrey (Committee member) / St. Louis, Robert (Committee member) / Kamarianakis, Ioannis (Committee member) / Arizona State University (Publisher)

Created2018

Essays on the identification and modeling of variance

Description

In the presence of correlation, generalized linear models cannot be employed to obtain regression parameter estimates. To appropriately address the extravariation due to correlation, methods to estimate and model the additional variation are investigated. A general form of the mean-variance relationship is proposed which incorporates the canonical parameter. The two…

In the presence of correlation, generalized linear models cannot be employed to obtain regression parameter estimates. To appropriately address the extravariation due to correlation, methods to estimate and model the additional variation are investigated. A general form of the mean-variance relationship is proposed which incorporates the canonical parameter. The two variance parameters are estimated using generalized method of moments, negating the need for a distributional assumption. The mean-variance relation estimates are applied to clustered data and implemented in an adjusted generalized quasi-likelihood approach through an adjustment to the covariance matrix. In the presence of significant correlation in hierarchical structured data, the adjusted generalized quasi-likelihood model shows improved performance for random effect estimates. In addition, submodels to address deviation in skewness and kurtosis are provided to jointly model the mean, variance, skewness, and kurtosis. The additional models identify covariates influencing the third and fourth moments. A cutoff to trim the data is provided which improves parameter estimation and model fit. For each topic, findings are demonstrated through comprehensive simulation studies and numerical examples. Examples evaluated include data on children’s morbidity in the Philippines, adolescent health from the National Longitudinal Study of Adolescent to Adult Health, as well as proteomic assays for breast cancer screening.

ContributorsIrimata, Katherine E (Author) / Wilson, Jeffrey R (Thesis advisor) / Kamarianakis, Ioannis (Committee member) / Kao, Ming-Hung (Committee member) / Reiser, Mark R. (Committee member) / Stufken, John (Committee member) / Arizona State University (Publisher)

Created2018

Are NBA Video Games Representing the Real Game? A Statistical Comparison of Phoenix Suns' Shooting Patterns and their Video Game Counterpart

Description

This paper intends to analyze the Phoenix Suns' shooting patterns in real NBA games, and compare them to the "NBA 2k16" Suns' shooting patterns. Data was collected from the first five Suns' games of the 2015-2016 season and the same games played in "NBA 2k16". The findings of this paper…

This paper intends to analyze the Phoenix Suns' shooting patterns in real NBA games, and compare them to the "NBA 2k16" Suns' shooting patterns. Data was collected from the first five Suns' games of the 2015-2016 season and the same games played in "NBA 2k16". The findings of this paper indicate that "NBA 2k16" utilizes statistical findings to model their gameplay. It was also determined that "NBA 2k16" modeled the shooting patterns of the Suns in the first five games of the 2015-2016 season very closely. Both, the real Suns' games and the "NBA 2k16" Suns' games, showed a higher probability of success for shots taken in the first eight seconds of the shot clock than the last eight seconds of the shot clock. Similarly, both game types illustrated a trend that the probability of success for a shot increases as a player holds onto a ball longer. This result was not expected for either game type, however, "NBA 2k16" modeled the findings consistent with real Suns' games. The video game modeled the Suns with significantly more passes per possession than the real Suns' games, while they also showed a trend that more passes per possession has a significant effect on the outcome of the shot. This trend was not present in the real Suns' games, however literature supports this finding. Also, "NBA 2k16" did not correctly model the allocation of team shots for each player, however, the differences were found only in bench players. Lastly, "NBA 2k16" did not correctly allocate shots across the seven regions for Eric Bledsoe, however, there was no evidence indicating that the game did not correctly model the allocation of shots for the other starters, as well as the probability of success across the regions.

ContributorsHarrington, John P. (Author) / Armbruster, Dieter (Thesis director) / Kamarianakis, Ioannis (Committee member) / Chemical Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2016-05

Prioritizing Projects on Time and Cost Savings for a more Efficient Manufacturing Process of a Semiconductor Company

Description

Over the course of six months, we have worked in partnership with Arizona State University and a leading producer of semiconductor chips in the United States market (referred to as the "Company"), lending our skills in finance, statistics, model building, and external insight. We attempt to design models that hel…

Over the course of six months, we have worked in partnership with Arizona State University and a leading producer of semiconductor chips in the United States market (referred to as the "Company"), lending our skills in finance, statistics, model building, and external insight. We attempt to design models that help predict how much time it takes to implement a cost-saving project. These projects had previously been considered only on the merit of cost savings, but with an added dimension of time, we hope to forecast time according to a number of variables. With such a forecast, we can then apply it to an expense project prioritization model which relates time and cost savings together, compares many different projects simultaneously, and returns a series of present value calculations over different ranges of time. The goal is twofold: assist with an accurate prediction of a project's time to implementation, and provide a basis to compare different projects based on their present values, ultimately helping to reduce the Company's manufacturing costs and improve gross margins. We believe this approach, and the research found toward this goal, is most valuable for the Company. Two coaches from the Company have provided assistance and clarified our questions when necessary throughout our research. In this paper, we begin by defining the problem, setting an objective, and establishing a checklist to monitor our progress. Next, our attention shifts to the data: making observations, trimming the dataset, framing and scoping the variables to be used for the analysis portion of the paper. Before creating a hypothesis, we perform a preliminary statistical analysis of certain individual variables to enrich our variable selection process. After the hypothesis, we run multiple linear regressions with project duration as the dependent variable. After regression analysis and a test for robustness, we shift our focus to an intuitive model based on rules of thumb. We relate these models to an expense project prioritization tool developed using Microsoft Excel software. Our deliverables to the Company come in the form of (1) a rules of thumb intuitive model and (2) an expense project prioritization tool.

ContributorsAl-Assi, Hashim (Co-author) / Chiang, Robert (Co-author) / Liu, Andrew (Co-author) / Ludwick, David (Co-author) / Simonson, Mark (Thesis director) / Hertzel, Michael (Committee member) / Barrett, The Honors College (Contributor) / Department of Information Systems (Contributor) / Department of Finance (Contributor) / Department of Economics (Contributor) / Department of Supply Chain Management (Contributor) / School of Accountancy (Contributor) / School of Mathematical and Statistical Sciences (Contributor) / Mechanical and Aerospace Engineering Program (Contributor) / WPC Graduate Programs (Contributor)

Created2015-05

Statistical Properties of Coherent Structures in Two Dimensional Turbulence

Description

Coherent vortices are ubiquitous structures in natural flows that affect mixing and transport of substances and momentum/energy. Being able to detect these coherent structures is important for pollutant mitigation, ecological conservation and many other aspects. In recent years, mathematical criteria and algorithms have been developed to extract these coherent structures…

Coherent vortices are ubiquitous structures in natural flows that affect mixing and transport of substances and momentum/energy. Being able to detect these coherent structures is important for pollutant mitigation, ecological conservation and many other aspects. In recent years, mathematical criteria and algorithms have been developed to extract these coherent structures in turbulent flows. In this study, we will apply these tools to extract important coherent structures and analyze their statistical properties as well as their implications on kinematics and dynamics of the flow. Such information will aide representation of small-scale nonlinear processes that large-scale models of natural processes may not be able to resolve.

ContributorsCass, Brentlee Jerry (Author) / Tang, Wenbo (Thesis director) / Kostelich, Eric (Committee member) / Department of Information Systems (Contributor) / School of Mathematical and Statistical Sciences (Contributor) / Barrett, The Honors College (Contributor)

Created2018-05

Exploring the Relation Between NAV and Price of ETFs in Financial Markets

Description

Exchange traded funds (ETFs) in many ways are similar to more traditional closed-end mutual funds, although thee differ in a crucial way. ETFs rely on a creation and redemption feature to achieve their functionality and this mechanism is designed to minimize the deviations that occur between the ETF’s listed price…

Exchange traded funds (ETFs) in many ways are similar to more traditional closed-end mutual funds, although thee differ in a crucial way. ETFs rely on a creation and redemption feature to achieve their functionality and this mechanism is designed to minimize the deviations that occur between the ETF’s listed price and the net asset value of the ETF’s underlying assets. However while this does cause ETF deviations to be generally lower than their mutual fund counterparts, as our paper explores this process does not eliminate these deviations completely. This article builds off an earlier paper by Engle and Sarkar (2006) that investigates these properties of premiums (discounts) of ETFs from their fair market value. And looks to see if these premia have changed in the last 10 years. Our paper then diverges from the original and takes a deeper look into the standard deviations of these premia specifically.

Our findings show that over 70% of an ETFs standard deviation of premia can be explained through a linear combination consisting of two variables: a categorical (Domestic[US], Developed, Emerging) and a discrete variable (time-difference from US). This paper also finds that more traditional metrics such as market cap, ETF price volatility, and even 3rd party market indicators such as the economic freedom index and investment freedom index are insignificant predictors of an ETFs standard deviation of premia when combined with the categorical variable. These findings differ somewhat from existing literature which indicate that these factors should have a significant impact on the predictive ability of an ETFs standard deviation of premia.

ContributorsZhang, Jingbo (Co-author, Co-author) / Henning, Thomas (Co-author) / Simonson, Mark (Thesis director) / Licon, L. Wendell (Committee member) / Department of Finance (Contributor) / Department of Information Systems (Contributor) / School of Mathematical and Statistical Sciences (Contributor) / Barrett, The Honors College (Contributor)

Created2019-05

Regression Analysis on Colony Collapse Disorder in the United States

Description

In the last decade, the population of honey bees across the globe has declined sharply leaving scientists and bee keepers to wonder why? Amongst all nations, the United States has seen some of the greatest declines in the last 10 plus years. Without a definite explanation, Colony Collapse Disorder (CCD)…

In the last decade, the population of honey bees across the globe has declined sharply leaving scientists and bee keepers to wonder why? Amongst all nations, the United States has seen some of the greatest declines in the last 10 plus years. Without a definite explanation, Colony Collapse Disorder (CCD) was coined to explain the sudden and sharp decline of the honey bee colonies that beekeepers were experiencing. Colony collapses have been rising higher compared to expected averages over the years, and during the winter season losses are even more severe than what is normally acceptable. There are some possible explanations pointing towards meteorological variables, diseases, and even pesticide usage. Despite the cause of CCD being unknown, thousands of beekeepers have reported their losses, and even numbers of infected colonies and colonies under certain stressors in the most recent years. Using the data that was reported to The United States Department of Agriculture (USDA), as well as weather data collected by The National Centers for Environmental Information (NOAA) and the National Centers for Environmental Information (NCEI), regression analysis was used to investigate honey bee colonies to find relationships between stressors in honey bee colonies and meteorological variables, and colony collapses during the winter months. The regression analysis focused on the winter season, or quarter 4 of the year, which includes the months of October, November, and December. In the model, the response variables was the percentage of colonies lost in quarter 4. Through the model, it was concluded that certain weather thresholds and the percentage increase of colonies under certain stressors were related to colony loss.

ContributorsVasquez, Henry Antony (Author) / Zheng, Yi (Thesis director) / Saffell, Erinanne (Committee member) / School of Mathematical and Statistical Sciences (Contributor) / Barrett, The Honors College (Contributor)

Created2018-05

Filtering by