Matching Items (14)
151517-Thumbnail Image.png
Description
Data mining is increasing in importance in solving a variety of industry problems. Our initiative involves the estimation of resource requirements by skill set for future projects by mining and analyzing actual resource consumption data from past projects in the semiconductor industry. To achieve this goal we face difficulties like

Data mining is increasing in importance in solving a variety of industry problems. Our initiative involves the estimation of resource requirements by skill set for future projects by mining and analyzing actual resource consumption data from past projects in the semiconductor industry. To achieve this goal we face difficulties like data with relevant consumption information but stored in different format and insufficient data about project attributes to interpret consumption data. Our first goal is to clean the historical data and organize it into meaningful structures for analysis. Once the preprocessing on data is completed, different data mining techniques like clustering is applied to find projects which involve resources of similar skillsets and which involve similar complexities and size. This results in "resource utilization templates" for groups of related projects from a resource consumption perspective. Then project characteristics are identified which generate this diversity in headcounts and skillsets. These characteristics are not currently contained in the data base and are elicited from the managers of historical projects. This represents an opportunity to improve the usefulness of the data collection system for the future. The ultimate goal is to match the product technical features with the resource requirement for projects in the past as a model to forecast resource requirements by skill set for future projects. The forecasting model is developed using linear regression with cross validation of the training data as the past project execution are relatively few in number. Acceptable levels of forecast accuracy are achieved relative to human experts' results and the tool is applied to forecast some future projects' resource demand.
ContributorsBhattacharya, Indrani (Author) / Sen, Arunabha (Thesis advisor) / Kempf, Karl G. (Thesis advisor) / Liu, Huan (Committee member) / Arizona State University (Publisher)
Created2013
153442-Thumbnail Image.png
Description
It has been identified in the literature that there exists a link between the built environment and non-motorized transport. This study aims to contribute to existing literature on the effects of the built environment on cycling, examining the case of the whole State of California. Physical built environment features are

It has been identified in the literature that there exists a link between the built environment and non-motorized transport. This study aims to contribute to existing literature on the effects of the built environment on cycling, examining the case of the whole State of California. Physical built environment features are classified into six groups as: 1) local density, 2) diversity of land use, 3) road connectivity, 4) bike route length, 5) green space, 6) job accessibility. Cycling trips in one week for all children, school children, adults and employed-adults are investigated separately. The regression analysis shows that cycling trips is significantly associated with some features of built environment when many socio-demographic factors are taken into account. Street intersections, bike route length tend to increase the use of bicycle. These effects are well-aligned with literature. Moreover, both local and regional job accessibility variables are statistically significant in two adults' models. However, residential density always has a significant negatively effect on cycling trips, which is still need further research to confirm. Also, there is a gap in literature on how green space affects cycling, but the results of this study is still too unclear to make it up. By elasticity analysis, this study concludes that street intersections is the most powerful predictor on cycling trips. From another perspective, the effects of built environment on cycling at workplace (or school) are distinguished from at home. This study implies that a wide range of measures are available for planners to control vehicle travel by improving cycling-level in California.
ContributorsWang, Kailai, M.U.E.P (Author) / Salon, Deborah (Thesis advisor) / Rey, Sergio (Committee member) / Li, Wenwen (Committee member) / Arizona State University (Publisher)
Created2015
153018-Thumbnail Image.png
Description
Urban scaling analysis has introduced a new scientific paradigm to the study of cities. With it, the notions of size, heterogeneity and structure have taken a leading role. These notions are assumed to be behind the causes for why cities differ from one another, sometimes wildly. However, the mechanisms by

Urban scaling analysis has introduced a new scientific paradigm to the study of cities. With it, the notions of size, heterogeneity and structure have taken a leading role. These notions are assumed to be behind the causes for why cities differ from one another, sometimes wildly. However, the mechanisms by which size, heterogeneity and structure shape the general statistical patterns that describe urban economic output are still unclear. Given the rapid rate of urbanization around the globe, we need precise and formal mathematical understandings of these matters. In this context, I perform in this dissertation probabilistic, distributional and computational explorations of (i) how the broadness, or narrowness, of the distribution of individual productivities within cities determines what and how we measure urban systemic output, (ii) how urban scaling may be expressed as a statistical statement when urban metrics display strong stochasticity, (iii) how the processes of aggregation constrain the variability of total urban output, and (iv) how the structure of urban skills diversification within cities induces a multiplicative process in the production of urban output.
ContributorsGómez-Liévano, Andrés (Author) / Lobo, Jose (Thesis advisor) / Muneepeerakul, Rachata (Thesis advisor) / Bettencourt, Luis M. A. (Committee member) / Chowell-Puente, Gerardo (Committee member) / Arizona State University (Publisher)
Created2014
150618-Thumbnail Image.png
Description
Coarsely grouped counts or frequencies are commonly used in the behavioral sciences. Grouped count and grouped frequency (GCGF) that are used as outcome variables often violate the assumptions of linear regression as well as models designed for categorical outcomes; there is no analytic model that is designed specifically to accommodate

Coarsely grouped counts or frequencies are commonly used in the behavioral sciences. Grouped count and grouped frequency (GCGF) that are used as outcome variables often violate the assumptions of linear regression as well as models designed for categorical outcomes; there is no analytic model that is designed specifically to accommodate GCGF outcomes. The purpose of this dissertation was to compare the statistical performance of four regression models (linear regression, Poisson regression, ordinal logistic regression, and beta regression) that can be used when the outcome is a GCGF variable. A simulation study was used to determine the power, type I error, and confidence interval (CI) coverage rates for these models under different conditions. Mean structure, variance structure, effect size, continuous or binary predictor, and sample size were included in the factorial design. Mean structures reflected either a linear relationship or an exponential relationship between the predictor and the outcome. Variance structures reflected homoscedastic (as in linear regression), heteroscedastic (monotonically increasing) or heteroscedastic (increasing then decreasing) variance. Small to medium, large, and very large effect sizes were examined. Sample sizes were 100, 200, 500, and 1000. Results of the simulation study showed that ordinal logistic regression produced type I error, statistical power, and CI coverage rates that were consistently within acceptable limits. Linear regression produced type I error and statistical power that were within acceptable limits, but CI coverage was too low for several conditions important to the analysis of counts and frequencies. Poisson regression and beta regression displayed inflated type I error, low statistical power, and low CI coverage rates for nearly all conditions. All models produced unbiased estimates of the regression coefficient. Based on the statistical performance of the four models, ordinal logistic regression seems to be the preferred method for analyzing GCGF outcomes. Linear regression also performed well, but CI coverage was too low for conditions with an exponential mean structure and/or heteroscedastic variance. Some aspects of model prediction, such as model fit, were not assessed here; more research is necessary to determine which statistical model best captures the unique properties of GCGF outcomes.
ContributorsCoxe, Stefany (Author) / Aiken, Leona S. (Thesis advisor) / West, Stephen G. (Thesis advisor) / Mackinnon, David P (Committee member) / Reiser, Mark R. (Committee member) / Arizona State University (Publisher)
Created2012
150996-Thumbnail Image.png
Description
A least total area of triangle method was proposed by Teissier (1948) for fitting a straight line to data from a pair of variables without treating either variable as the dependent variable while allowing each of the variables to have measurement errors. This method is commonly called Reduced Major Axis

A least total area of triangle method was proposed by Teissier (1948) for fitting a straight line to data from a pair of variables without treating either variable as the dependent variable while allowing each of the variables to have measurement errors. This method is commonly called Reduced Major Axis (RMA) regression and is often used instead of Ordinary Least Squares (OLS) regression. Results for confidence intervals, hypothesis testing and asymptotic distributions of coefficient estimates in the bivariate case are reviewed. A generalization of RMA to more than two variables for fitting a plane to data is obtained by minimizing the sum of a function of the volumes obtained by drawing, from each data point, lines parallel to each coordinate axis to the fitted plane (Draper and Yang 1997; Goodman and Tofallis 2003). Generalized RMA results for the multivariate case obtained by Draper and Yang (1997) are reviewed and some investigations of multivariate RMA are given. A linear model is proposed that does not specify a dependent variable and allows for errors in the measurement of each variable. Coefficients in the model are estimated by minimization of the function of the volumes previously mentioned. Methods for obtaining coefficient estimates are discussed and simulations are used to investigate the distribution of coefficient estimates. The effects of sample size, sampling error and correlation among variables on the estimates are studied. Bootstrap methods are used to obtain confidence intervals for model coefficients. Residual analysis is considered for assessing model assumptions. Outlier and influential case diagnostics are developed and a forward selection method is proposed for subset selection of model variables. A real data example is provided that uses the methods developed. Topics for further research are discussed.
ContributorsLi, Jingjin (Author) / Young, Dennis (Thesis advisor) / Eubank, Randall (Thesis advisor) / Reiser, Mark R. (Committee member) / Kao, Ming-Hung (Committee member) / Yang, Yan (Committee member) / Arizona State University (Publisher)
Created2012
157274-Thumbnail Image.png
Description
Bayesian Additive Regression Trees (BART) is a non-parametric Bayesian model

that often outperforms other popular predictive models in terms of out-of-sample error. This thesis studies a modified version of BART called Accelerated Bayesian Additive Regression Trees (XBART). The study consists of simulation and real data experiments comparing XBART to other leading

Bayesian Additive Regression Trees (BART) is a non-parametric Bayesian model

that often outperforms other popular predictive models in terms of out-of-sample error. This thesis studies a modified version of BART called Accelerated Bayesian Additive Regression Trees (XBART). The study consists of simulation and real data experiments comparing XBART to other leading algorithms, including BART. The results show that XBART maintains BART’s predictive power while reducing its computation time. The thesis also describes the development of a Python package implementing XBART.
ContributorsYalov, Saar (Author) / Hahn, P. Richard (Thesis advisor) / McCulloch, Robert (Committee member) / Kao, Ming-Hung (Committee member) / Arizona State University (Publisher)
Created2019
135352-Thumbnail Image.png
Description
The goal of our study is to identify socio-economic risk factors for depressive disorder and poor mental health by statistically analyzing survey data from the CDC. The identification of risk groups in a particular demographic could aid in the development of targeted interventions to improve overall quality of mental health

The goal of our study is to identify socio-economic risk factors for depressive disorder and poor mental health by statistically analyzing survey data from the CDC. The identification of risk groups in a particular demographic could aid in the development of targeted interventions to improve overall quality of mental health in the United States. In our analysis, we studied the influences and correlations of socioeconomic factors that regulate the risk of developing Depressive Disorders and overall poor mental health. Using the statistical software STATA, we ran a regression model of selected independent socio-economic variables with the dependent mental health variables. The independent variables of the statistical model include Income, Race, State, Age, Marital Status, Sex, Education, BMI, Smoker Status, and Alcohol Consumption. Once the regression coefficients were found, we illustrated the data in graphs and heat maps to qualitatively provide visuals of the prevalence of depression in the U.S. demography. Our study indicates that the low-income and under-educated populations who are everyday smokers, obese, and/or are in divorced or separated relationships should be of main concern. A suggestion for mental health organizations would be to support counseling and therapeutic efforts as secondary care for those in smoking cessation programs, weight management programs, marriage counseling, or divorce assistance group. General improvement in alleviating poverty and increasing education could additionally show progress in counter-acting the prevalence of depressive disorder and also improve overall mental health. The identification of these target groups and socio-economic risk factors are critical in developing future preventative measures.
ContributorsGrassel, Samuel (Co-author) / Choueiri, Alexi (Co-author) / Choueiri, Robert (Co-author) / Goegan, Brian (Thesis director) / Holter, Michael (Committee member) / Sandra Day O'Connor College of Law (Contributor) / School of Molecular Sciences (Contributor) / School of Politics and Global Studies (Contributor) / Economics Program in CLAS (Contributor) / Barrett, The Honors College (Contributor)
Created2016-05
133798-Thumbnail Image.png
Description
The United States Open Championship, often referred to as the U.S. Open, is one of the four major championships in professional golf. Held annually in June, the tournament changes venues each year and must meet a strict criterion to challenge the best players in the world. Undergoing an evaluation conducted

The United States Open Championship, often referred to as the U.S. Open, is one of the four major championships in professional golf. Held annually in June, the tournament changes venues each year and must meet a strict criterion to challenge the best players in the world. Undergoing an evaluation conducted by the United States Golf Association, the potential course is assessed on its quality and design. Along with this, the course is evaluated on its ability to hold various obstructions and thousands of spectators, while also providing plenty of space for parking, ease of transportation access, and a close proximity to local airports and lodging. Of the thousands of courses in the United States, only a select few have had the opportunity to host a U.S. Open, and far fewer have had the chance to host it on multiple occasions. Therefore, we are prepared to create the next venue that has the capabilities of hosting many U.S. Open tournaments for years to come.
ContributorsCostello, Alec (Co-author) / Miller, Alec (Co-author) / Foster, William (Thesis director) / McIntosh, Daniel (Committee member) / W.P. Carey School of Business (Contributor) / Department of Finance (Contributor) / Economics Program in CLAS (Contributor) / School of International Letters and Cultures (Contributor) / Barrett, The Honors College (Contributor)
Created2018-05
134477-Thumbnail Image.png
Description
Today, statistical analysis can be used for a variety of different reasons. In sports, more particularly baseball, there is an increasing necessity to have better up to date analysis of players and their performance as they attempt to make it to the Major League. Athletes are constantly moving around within

Today, statistical analysis can be used for a variety of different reasons. In sports, more particularly baseball, there is an increasing necessity to have better up to date analysis of players and their performance as they attempt to make it to the Major League. Athletes are constantly moving around within one or more organizations. Since they are moving around so often, clubs spend an ample amount of time determining whether or not it is for their benefit and betterment of the organization as a whole. The objective of this thesis is to utilize previous baseball statistics in StataSE to determine performance levels of players who played at the major league level. From these, regression-based performance models will be used to predict whether or not Major League Baseball organizations effectively and efficiently move players around from their farm systems to the big leagues. From this, teams will be able to see whether or not they in fact make the right decisions during the season. Several tasks were accomplished to achieve this outcome: 1. First, data was obtained from the Baseball-Reference statistics database and sorted in google sheets in order for me to perform analysis anywhere. 2. Next, all 1,354 players that entered the major leagues in the year 2016, were assessed as to whether or not they started in a given league and stayed, got promoted from the minor leagues to the majors, or demoted from the majors to the minor leagues. 3. Based off of prior baseball knowledge and offensive performance quantifications only, players' abilities were evaluated and only those who were called up or sent down were included in the overall analysis. 4. The statistical analysis software application, StataSE, was used to create a further analyze if any of the four major regression assumptions were violated. It was determined that logistic regression models would produce better results than that of a standard, linear OLS model. After testing multiple models, and slightly refining my hypothesis, the adjustments made developed a more accurate analysis of whether organizations were making an efficient move sending a player down to promote another player up. After producing the model, I decided to investigate at what level a player was deemed to be no longer able to perform at a Major League Baseball level.
ContributorsHayes, Andrew Joseph (Author) / Goegan, Brian (Thesis director) / Marburger, Daniel (Committee member) / Department of Economics (Contributor) / Barrett, The Honors College (Contributor)
Created2017-05
155855-Thumbnail Image.png
Description
Time-to-event analysis or equivalently, survival analysis deals with two variables simultaneously: when (time information) an event occurs and whether an event occurrence is observed or not during the observation period (censoring information). In behavioral and social sciences, the event of interest usually does not lead to a terminal state

Time-to-event analysis or equivalently, survival analysis deals with two variables simultaneously: when (time information) an event occurs and whether an event occurrence is observed or not during the observation period (censoring information). In behavioral and social sciences, the event of interest usually does not lead to a terminal state such as death. Other outcomes after the event can be collected and thus, the survival variable can be considered as a predictor as well as an outcome in a study. One example of a case where the survival variable serves as a predictor as well as an outcome is a survival-mediator model. In a single survival-mediator model an independent variable, X predicts a survival variable, M which in turn, predicts a continuous outcome, Y. The survival-mediator model consists of two regression equations: X predicting M (M-regression), and M and X simultaneously predicting Y (Y-regression). To estimate the regression coefficients of the survival-mediator model, Cox regression is used for the M-regression. Ordinary least squares regression is used for the Y-regression using complete case analysis assuming censored data in M are missing completely at random so that the Y-regression is unbiased. In this dissertation research, different measures for the indirect effect were proposed and a simulation study was conducted to compare performance of different indirect effect test methods. Bias-corrected bootstrapping produced high Type I error rates as well as low parameter coverage rates in some conditions. In contrast, the Sobel test produced low Type I error rates as well as high parameter coverage rates in some conditions. The bootstrap of the natural indirect effect produced low Type I error and low statistical power when the censoring proportion was non-zero. Percentile bootstrapping, distribution of the product and the joint-significance test showed best performance. Statistical analysis of the survival-mediator model is discussed. Two indirect effect measures, the ab-product and the natural indirect effect are compared and discussed. Limitations and future directions of the simulation study are discussed. Last, interpretation of the survival-mediator model for a made-up empirical data set is provided to clarify the meaning of the quantities in the survival-mediator model.
ContributorsKim, Han Joe (Author) / Mackinnon, David P. (Thesis advisor) / Tein, Jenn-Yun (Thesis advisor) / West, Stephen G. (Committee member) / Grimm, Kevin J. (Committee member) / Arizona State University (Publisher)
Created2017