Search Content

Alternative methods via random forest to identify interactions in a general framework and variable importance in the context of value-added models

Description

This work presents two complementary studies that propose heuristic methods to capture characteristics of data using the ensemble learning method of random forest. The first study is motivated by the problem in education of determining teacher effectiveness in student achievement. Value-added models (VAMs), constructed as linear mixed models, use students’…

This work presents two complementary studies that propose heuristic methods to capture characteristics of data using the ensemble learning method of random forest. The first study is motivated by the problem in education of determining teacher effectiveness in student achievement. Value-added models (VAMs), constructed as linear mixed models, use students’ test scores as outcome variables and teachers’ contributions as random effects to ascribe changes in student performance to the teachers who have taught them. The VAMs teacher score is the empirical best linear unbiased predictor (EBLUP). This approach is limited by the adequacy of the assumed model specification with respect to the unknown underlying model. In that regard, this study proposes alternative ways to rank teacher effects that are not dependent on a given model by introducing two variable importance measures (VIMs), the node-proportion and the covariate-proportion. These VIMs are novel because they take into account the final configuration of the terminal nodes in the constitutive trees in a random forest. In a simulation study, under a variety of conditions, true rankings of teacher effects are compared with estimated rankings obtained using three sources: the newly proposed VIMs, existing VIMs, and EBLUPs from the assumed linear model specification. The newly proposed VIMs outperform all others in various scenarios where the model was misspecified. The second study develops two novel interaction measures. These measures could be used within but are not restricted to the VAM framework. The distribution-based measure is constructed to identify interactions in a general setting where a model specification is not assumed in advance. In turn, the mean-based measure is built to estimate interactions when the model specification is assumed to be linear. Both measures are unique in their construction; they take into account not only the outcome values, but also the internal structure of the trees in a random forest. In a separate simulation study, under a variety of conditions, the proposed measures are found to identify and estimate second-order interactions.

ContributorsValdivia, Arturo (Author) / Eubank, Randall (Thesis advisor) / Young, Dennis (Committee member) / Reiser, Mark R. (Committee member) / Kao, Ming-Hung (Committee member) / Broatch, Jennifer (Committee member) / Arizona State University (Publisher)

Created2013

Testing independence of parallel pseudorandom number streams: incorporating the data's multivariate nature

Description

Parallel Monte Carlo applications require the pseudorandom numbers used on each processor to be independent in a probabilistic sense. The TestU01 software package is the standard testing suite for detecting stream dependence and other properties that make certain pseudorandom generators ineffective in parallel (as well as serial) settings. TestU01 employs…

Parallel Monte Carlo applications require the pseudorandom numbers used on each processor to be independent in a probabilistic sense. The TestU01 software package is the standard testing suite for detecting stream dependence and other properties that make certain pseudorandom generators ineffective in parallel (as well as serial) settings. TestU01 employs two basic schemes for testing parallel generated streams. The first applies serial tests to the individual streams and then tests the resulting P-values for uniformity. The second turns all the parallel generated streams into one long vector and then applies serial tests to the resulting concatenated stream. Various forms of stream dependence can be missed by each approach because neither one fully addresses the multivariate nature of the accumulated data when generators are run in parallel. This dissertation identifies these potential faults in the parallel testing methodologies of TestU01 and investigates two different methods to better detect inter-stream dependencies: correlation motivated multivariate tests and vector time series based tests. These methods have been implemented in an extension to TestU01 built in C++ and the unique aspects of this extension are discussed. A variety of different generation scenarios are then examined using the TestU01 suite in concert with the extension. This enhanced software package is found to better detect certain forms of inter-stream dependencies than the original TestU01 suites of tests.

ContributorsIsmay, Chester (Author) / Eubank, Randall (Thesis advisor) / Young, Dennis (Committee member) / Kao, Ming-Hung (Committee member) / Lanchier, Nicolas (Committee member) / Reiser, Mark R. (Committee member) / Arizona State University (Publisher)

Created2013

Chi-square orthogonal components for assessing goodness-of-fit of multidimensional multinomial data

Description

It is common in the analysis of data to provide a goodness-of-fit test to assess the performance of a model. In the analysis of contingency tables, goodness-of-fit statistics are frequently employed when modeling social science, educational or psychological data where the interest is often directed at investigating the association among…

It is common in the analysis of data to provide a goodness-of-fit test to assess the performance of a model. In the analysis of contingency tables, goodness-of-fit statistics are frequently employed when modeling social science, educational or psychological data where the interest is often directed at investigating the association among multi-categorical variables. Pearson's chi-squared statistic is well-known in goodness-of-fit testing, but it is sometimes considered to produce an omnibus test as it gives little guidance to the source of poor fit once the null hypothesis is rejected. However, its components can provide powerful directional tests. In this dissertation, orthogonal components are used to develop goodness-of-fit tests for models fit to the counts obtained from the cross-classification of multi-category dependent variables. Ordinal categories are assumed. Orthogonal components defined on marginals are obtained when analyzing multi-dimensional contingency tables through the use of the QR decomposition. A subset of these orthogonal components can be used to construct limited-information tests that allow one to identify the source of lack-of-fit and provide an increase in power compared to Pearson's test. These tests can address the adverse effects presented when data are sparse. The tests rely on the set of first- and second-order marginals jointly, the set of second-order marginals only, and the random forest method, a popular algorithm for modeling large complex data sets. The performance of these tests is compared to the likelihood ratio test as well as to tests based on orthogonal polynomial components. The derived goodness-of-fit tests are evaluated with studies for detecting two- and three-way associations that are not accounted for by a categorical variable factor model with a single latent variable. In addition the tests are used to investigate the case when the model misspecification involves parameter constraints for large and sparse contingency tables. The methodology proposed here is applied to data from the 38th round of the State Survey conducted by the Institute for Public Policy and Michigan State University Social Research (2005) . The results illustrate the use of the proposed techniques in the context of a sparse data set.

ContributorsMilovanovic, Jelena (Author) / Young, Dennis (Thesis advisor) / Reiser, Mark R. (Thesis advisor) / Wilson, Jeffrey (Committee member) / Eubank, Randall (Committee member) / Yang, Yan (Committee member) / Arizona State University (Publisher)

Created2011

Some topics concerning the singular value decomposition and generalized singular value decomposition

Description

This dissertation involves three problems that are all related by the use of the singular value decomposition (SVD) or generalized singular value decomposition (GSVD). The specific problems are (i) derivation of a generalized singular value expansion (GSVE), (ii) analysis of the properties of the chi-squared method for regularization parameter selection…

This dissertation involves three problems that are all related by the use of the singular value decomposition (SVD) or generalized singular value decomposition (GSVD). The specific problems are (i) derivation of a generalized singular value expansion (GSVE), (ii) analysis of the properties of the chi-squared method for regularization parameter selection in the case of nonnormal data and (iii) formulation of a partial canonical correlation concept for continuous time stochastic processes. The finite dimensional SVD has an infinite dimensional generalization to compact operators. However, the form of the finite dimensional GSVD developed in, e.g., Van Loan does not extend directly to infinite dimensions as a result of a key step in the proof that is specific to the matrix case. Thus, the first problem of interest is to find an infinite dimensional version of the GSVD. One such GSVE for compact operators on separable Hilbert spaces is developed. The second problem concerns regularization parameter estimation. The chi-squared method for nonnormal data is considered. A form of the optimized regularization criterion that pertains to measured data or signals with nonnormal noise is derived. Large sample theory for phi-mixing processes is used to derive a central limit theorem for the chi-squared criterion that holds under certain conditions. Departures from normality are seen to manifest in the need for a possibly different scale factor in normalization rather than what would be used under the assumption of normality. The consequences of our large sample work are illustrated by empirical experiments. For the third problem, a new approach is examined for studying the relationships between a collection of functional random variables. The idea is based on the work of Sunder that provides mappings to connect the elements of algebraic and orthogonal direct sums of subspaces in a Hilbert space. When combined with a key isometry associated with a particular Hilbert space indexed stochastic process, this leads to a useful formulation for situations that involve the study of several second order processes. In particular, using our approach with two processes provides an independent derivation of the functional canonical correlation analysis (CCA) results of Eubank and Hsing. For more than two processes, a rigorous derivation of the functional partial canonical correlation analysis (PCCA) concept that applies to both finite and infinite dimensional settings is obtained.

ContributorsHuang, Qing (Author) / Eubank, Randall (Thesis advisor) / Renaut, Rosemary (Thesis advisor) / Cochran, Douglas (Committee member) / Gelb, Anne (Committee member) / Young, Dennis (Committee member) / Arizona State University (Publisher)

Created2012

Multivariate generalization of reduced major axis regression

Description

A least total area of triangle method was proposed by Teissier (1948) for fitting a straight line to data from a pair of variables without treating either variable as the dependent variable while allowing each of the variables to have measurement errors. This method is commonly called Reduced Major Axis…

A least total area of triangle method was proposed by Teissier (1948) for fitting a straight line to data from a pair of variables without treating either variable as the dependent variable while allowing each of the variables to have measurement errors. This method is commonly called Reduced Major Axis (RMA) regression and is often used instead of Ordinary Least Squares (OLS) regression. Results for confidence intervals, hypothesis testing and asymptotic distributions of coefficient estimates in the bivariate case are reviewed. A generalization of RMA to more than two variables for fitting a plane to data is obtained by minimizing the sum of a function of the volumes obtained by drawing, from each data point, lines parallel to each coordinate axis to the fitted plane (Draper and Yang 1997; Goodman and Tofallis 2003). Generalized RMA results for the multivariate case obtained by Draper and Yang (1997) are reviewed and some investigations of multivariate RMA are given. A linear model is proposed that does not specify a dependent variable and allows for errors in the measurement of each variable. Coefficients in the model are estimated by minimization of the function of the volumes previously mentioned. Methods for obtaining coefficient estimates are discussed and simulations are used to investigate the distribution of coefficient estimates. The effects of sample size, sampling error and correlation among variables on the estimates are studied. Bootstrap methods are used to obtain confidence intervals for model coefficients. Residual analysis is considered for assessing model assumptions. Outlier and influential case diagnostics are developed and a forward selection method is proposed for subset selection of model variables. A real data example is provided that uses the methods developed. Topics for further research are discussed.

ContributorsLi, Jingjin (Author) / Young, Dennis (Thesis advisor) / Eubank, Randall (Thesis advisor) / Reiser, Mark R. (Committee member) / Kao, Ming-Hung (Committee member) / Yang, Yan (Committee member) / Arizona State University (Publisher)

Created2012

A Guide to Financial Mathematics

Description

A Guide to Financial Mathematics is a comprehensive and easy-to-use study guide for students studying for the one of the first actuarial exams, Exam FM. While there are many resources available to students to study for these exams, this study is free to the students and offers an approach to…

A Guide to Financial Mathematics is a comprehensive and easy-to-use study guide for students studying for the one of the first actuarial exams, Exam FM. While there are many resources available to students to study for these exams, this study is free to the students and offers an approach to the material similar to that of which is presented in class at ASU. The guide is available to students and professors in the new Actuarial Science degree program offered by ASU. There are twelve chapters, including financial calculator tips, detailed notes, examples, and practice exercises. Included at the end of the guide is a list of referenced material.

ContributorsDougher, Caroline Marie (Author) / Milovanovic, Jelena (Thesis director) / Boggess, May (Committee member) / Barrett, The Honors College (Contributor) / Department of Information Systems (Contributor) / School of Mathematical and Statistical Sciences (Contributor)

Created2015-05

Using Generalized Linear Models to Develop Loss Triangles in Reserving

Description

The use of generalized linear models in loss reserving is not new; many statistical models have been developed to fit the loss data gathered by various insurance companies. The most popular models belong to what Glen Barnett and Ben Zehnwirth in "Best Estimates for Reserves" call the "extended link ratio…

The use of generalized linear models in loss reserving is not new; many statistical models have been developed to fit the loss data gathered by various insurance companies. The most popular models belong to what Glen Barnett and Ben Zehnwirth in "Best Estimates for Reserves" call the "extended link ratio family (ELRF)," as they are developed from the chain ladder algorithm used by actuaries to estimate unpaid claims. Although these models are intuitive and easy to implement, they are nevertheless flawed because many of the assumptions behind the models do not hold true when fitted with real-world data. Even more problematically, the ELRF cannot account for environmental changes like inflation which are often observed in the status quo. Barnett and Zehnwirth conclude that a new set of models that contain parameters for not only accident year and development period trends but also payment year trends would be a more accurate predictor of loss development. This research applies the paper's ideas to data gathered by Company XYZ. The data was fitted with an adapted version of Barnett and Zehnwirth's new model in R, and a trend selection algorithm was developed to accompany the regression code. The final forecasts were compared to Company XYZ's booked reserves to evaluate the predictive power of the model.

ContributorsZhang, Zhihan Jennifer (Author) / Milovanovic, Jelena (Thesis director) / Tomita, Melissa (Committee member) / Zicarelli, John (Committee member) / W.P. Carey School of Business (Contributor) / School of Mathematical and Statistical Sciences (Contributor) / Barrett, The Honors College (Contributor)

Created2018-05

Linear Modeling for Insurance Ratemaking/Reserving: Modeling Loss Development Factors for Catastrophe Claims

Description

Catastrophe events occur rather infrequently, but upon their occurrence, can lead to colossal losses for insurance companies. Due to their size and volatility, catastrophe losses are often treated separately from other insurance losses. In fact, many property and casualty insurance companies feature a department or team which focuses solely on…

Catastrophe events occur rather infrequently, but upon their occurrence, can lead to colossal losses for insurance companies. Due to their size and volatility, catastrophe losses are often treated separately from other insurance losses. In fact, many property and casualty insurance companies feature a department or team which focuses solely on modeling catastrophes. Setting reserves for catastrophe losses is difficult due to their unpredictable and often long-tailed nature. Determining loss development factors (LDFs) to estimate the ultimate loss amounts for catastrophe events is one method for setting reserves. In an attempt to aid Company XYZ set more accurate reserves, the research conducted focuses on estimating LDFs for catastrophes which have already occurred and have been settled. Furthermore, the research describes the process used to build a linear model in R to estimate LDFs for Company XYZ's closed catastrophe claims from 2001 \u2014 2016. This linear model was used to predict a catastrophe's LDFs based on the age in weeks of the catastrophe during the first year. Back testing was also performed, as was the comparison between the estimated ultimate losses and actual losses. Future research consideration was proposed.

ContributorsSwoverland, Robert Bo (Author) / Milovanovic, Jelena (Thesis director) / Zicarelli, John (Committee member) / School of Mathematical and Statistical Sciences (Contributor) / Barrett, The Honors College (Contributor)

Created2018-05

Gilded Pastures

Description

This expository thesis explores the financial health and actuarial analysis of a particular solution for those seeking stability and security in their golden years: the CCRC industry. A continuing care retirement community, or CCRC, is a comprehensive project and campus that offers its residents a full spectrum of care from…

This expository thesis explores the financial health and actuarial analysis of a particular solution for those seeking stability and security in their golden years: the CCRC industry. A continuing care retirement community, or CCRC, is a comprehensive project and campus that offers its residents a full spectrum of care from independent living, to assisted living, to skilled nursing. After reading this paper, any person with no prior knowledge of a continuing care retirement community should gain a firm understanding of the background, risks and benefits, and legislative safeguards of this complex industry. Financially, a CCRC operates in some aspects similar to long-term care (LTC) insurance. However, CCRCs provide multiple levels of care operations while maintaining a pleasant, engaging community environment where seniors can have all their lifestyle needs met. The expensive and complex operations of a CCRC are not without risk: the industry has seen marked periods of bankruptcy followed by increasing and changing regulatory oversight. Thus, CCRCs require a periodic actuarial analysis and report, among array of other legislative safeguards against bankruptcy. A CCRC's insolvency or inability to meet its obligations can be catastrophic and inflict suffering and damages not only to its residents but also their friends and families. With seniors historically being one of the most vulnerable demographic groups, it is absolutely essential that an all-encompassing care facility continues to exist and fulfill its contractual promises by maintaining sound actuarial practices and financial health. This thesis, in addition to providing an exposition of the background and functions of the CCRC, describes the existing actuarial and financial studies and audits in practice to ensure sound governance and the quality of life of CCRC residents.

ContributorsTang, Julie (Author) / Milovanovic, Jelena (Thesis director) / Hassett, Matthew J. (Committee member) / School of Mathematical and Statistical Sciences (Contributor) / Barrett, The Honors College (Contributor)

Created2017-05

Stochastic Modeling of Loss Reserves Using Bayesian Monte Carlo Markov Chain

Description

The Mack model and the Bootstrap Over-Dispersed Poisson model have long been the primary modeling tools used by actuaries and insurers to forecast losses. With the emergence of faster computational technology, new and novel methods to calculate and simulate data are more applicable than ever before. This paper explores the…

The Mack model and the Bootstrap Over-Dispersed Poisson model have long been the primary modeling tools used by actuaries and insurers to forecast losses. With the emergence of faster computational technology, new and novel methods to calculate and simulate data are more applicable than ever before. This paper explores the use of various Bayesian Monte Carlo Markov Chain models recommended by Glenn Meyers and compares the results to the simulated data from the Mack model and the Bootstrap Over-Dispersed Poisson model. Although the Mack model and the Bootstrap Over-Dispersed Poisson model are accurate to a certain degree, newer models could be developed that may yield better results. However, a general concern is that no singular model is able to reflect underlying information that only an individual who has intimate knowledge of the data would know. Thus, the purpose of this paper is not to distinguish one model that works for all applicable data, but to propose various models that have pros and cons and suggest ways that they can be improved upon.

ContributorsZhang, Zhaobo (Author) / Zicarelli, John (Thesis director) / Milovanovic, Jelena (Committee member) / Barrett, The Honors College (Contributor) / School of Mathematical and Statistical Sciences (Contributor)

Created2023-05

Filtering by