Search Content

Probabilistic Modeling and Regression Analysis of Experimental Data for Structural Systems

Description

The Experimental Data Processing (EDP) software is a C++ GUI-based application to streamline the process of creating a model for structural systems based on experimental data. EDP is designed to process raw data, filter the data for noise and outliers, create a fitted model to describe that data, complete a…

The Experimental Data Processing (EDP) software is a C++ GUI-based application to streamline the process of creating a model for structural systems based on experimental data. EDP is designed to process raw data, filter the data for noise and outliers, create a fitted model to describe that data, complete a probabilistic analysis to describe the variation between replicates of the experimental process, and analyze reliability of a structural system based on that model. In order to help design the EDP software to perform the full analysis, the probabilistic and regression modeling aspects of this analysis have been explored. The focus has been on creating and analyzing probabilistic models for the data, adding multivariate and nonparametric fits to raw data, and developing computational techniques that allow for these methods to be properly implemented within EDP. For creating a probabilistic model of replicate data, the normal, lognormal, gamma, Weibull, and generalized exponential distributions have been explored. Goodness-of-fit tests, including the chi-squared, Anderson-Darling, and Kolmogorov-Smirnoff tests, have been used in order to analyze the effectiveness of any of these probabilistic models in describing the variation of parameters between replicates of an experimental test. An example using Young's modulus data for a Kevlar-49 Swath stress-strain test was used in order to demonstrate how this analysis is performed within EDP. In order to implement the distributions, numerical solutions for the gamma, beta, and hypergeometric functions were implemented, along with an arbitrary precision library to store numbers that exceed the maximum size of double-precision floating point digits. To create a multivariate fit, the multilinear solution was created as the simplest solution to the multivariate regression problem. This solution was then extended to solve nonlinear problems that can be linearized into multiple separable terms. These problems were solved analytically with the closed-form solution for the multilinear regression, and then by using a QR decomposition to solve numerically while avoiding numerical instabilities associated with matrix inversion. For nonparametric regression, or smoothing, the loess method was developed as a robust technique for filtering noise while maintaining the general structure of the data points. The loess solution was created by addressing concerns associated with simpler smoothing methods, including the running mean, running line, and kernel smoothing techniques, and combining the ability of each of these methods to resolve those issues. The loess smoothing method involves weighting each point in a partition of the data set, and then adding either a line or a polynomial fit within that partition. Both linear and quadratic methods were applied to a carbon fiber compression test, showing that the quadratic model was more accurate but the linear model had a shape that was more effective for analyzing the experimental data. Finally, the EDP program itself was explored to consider its current functionalities for processing data, as described by shear tests on carbon fiber data, and the future functionalities to be developed. The probabilistic and raw data processing capabilities were demonstrated within EDP, and the multivariate and loess analysis was demonstrated using R. As the functionality and relevant considerations for these methods have been developed, the immediate goal is to finish implementing and integrating these additional features into a version of EDP that performs a full streamlined structural analysis on experimental data.

ContributorsMarkov, Elan Richard (Author) / Rajan, Subramaniam (Thesis director) / Khaled, Bilal (Committee member) / Chemical Engineering Program (Contributor) / School of Mathematical and Statistical Sciences (Contributor) / Ira A. Fulton School of Engineering (Contributor) / Barrett, The Honors College (Contributor)

Created2016-05

Player Optimization in the National Football League: Creating a Winning Franchise

Description

The NFL is one of largest and most influential industries in the world. In America there are few companies that have a stronger hold on the American culture and create such a phenomena from year to year. In this project aimed to develop a strategy that helps an NFL team…

The NFL is one of largest and most influential industries in the world. In America there are few companies that have a stronger hold on the American culture and create such a phenomena from year to year. In this project aimed to develop a strategy that helps an NFL team be as successful as possible by defining which positions are most important to a team's success. Data from fifteen years of NFL games was collected and information on every player in the league was analyzed. First there needed to be a benchmark which describes a team as being average and then every player in the NFL must be compared to that average. Based on properties of linear regression using ordinary least squares this project aims to define such a model that shows each position's importance. Finally, once such a model had been established then the focus turned to the NFL draft in which the goal was to find a strategy of where each position needs to be drafted so that it is most likely to give the best payoff based on the results of the regression in part one.

ContributorsBalzer, Kevin Ryan (Author) / Goegan, Brian (Thesis director) / Dassanayake, Maduranga (Committee member) / Barrett, The Honors College (Contributor) / Economics Program in CLAS (Contributor) / School of Mathematical and Statistical Sciences (Contributor)

Created2015-05

A Model for the Division of Labor Through Network Interactions

Description

We model communication among social insects as an interacting particle system in which individuals perform one of two tasks and neighboring sites anti-mimic one another. Parameters of our model are a probability of defection 2 (0; 1) and relative cost ci > 0 to the individual performing task i. We…

We model communication among social insects as an interacting particle system in which individuals perform one of two tasks and neighboring sites anti-mimic one another. Parameters of our model are a probability of defection 2 (0; 1) and relative cost ci > 0 to the individual performing task i. We examine this process on complete graphs, bipartite graphs, and the integers, answering questions about the relationship between communication, defection rates and the division of labor. Assuming the division of labor is ideal when exactly half of the colony is performing each task, we nd that on some bipartite graphs and the integers it can eventually be made arbitrarily close to optimal if defection rates are sufficiently small. On complete graphs the fraction of individuals performing each task is also closest to one half when there is no defection, but is bounded by a constant dependent on the relative costs of each task.

ContributorsArcuri, Alesandro Antonio (Author) / Lanchier, Nicolas (Thesis director) / Kang, Yun (Committee member) / Fewell, Jennifer (Committee member) / Barrett, The Honors College (Contributor) / School of International Letters and Cultures (Contributor) / Economics Program in CLAS (Contributor) / School of Mathematical and Statistical Sciences (Contributor)

Created2015-05

Exploring the Relation Between NAV and Price of ETFs in Financial Markets

Description

Exchange traded funds (ETFs) in many ways are similar to more traditional closed-end mutual
funds, although thee differ in a crucial way. ETFs rely on a creation and redemption feature to
achieve their functionality and this mechanism is designed to minimize the deviations that occur
between the ETF’s listed price and the net…

Exchange traded funds (ETFs) in many ways are similar to more traditional closed-end mutual
funds, although thee differ in a crucial way. ETFs rely on a creation and redemption feature to
achieve their functionality and this mechanism is designed to minimize the deviations that occur
between the ETF’s listed price and the net asset value of the ETF’s underlying assets. However
while this does cause ETF deviations to be generally lower than their mutual fund counterparts,
as our paper explores this process does not eliminate these deviations completely. This article
builds off an earlier paper by Engle and Sarkar (2006) that investigates these properties of
premiums (discounts) of ETFs from their fair market value. And looks to see if these premia
have changed in the last 10 years. Our paper then diverges from the original and takes a deeper
look into the standard deviations of these premia specifically.
Our findings show that over 70% of an ETFs standard deviation of premia can be
explained through a linear combination consisting of two variables: a categorical (Domestic[US],
Developed, Emerging) and a discrete variable (time-difference from US). This paper also finds
that more traditional metrics such as market cap, ETF price volatility, and even 3rd party market
indicators such as the economic freedom index and investment freedom index are insignificant
predictors of an ETFs standard deviation of premia. These findings differ somewhat from
existing literature which indicate that these factors should have a significant impact on the
predictive ability of an ETFs standard deviation of premia.

ContributorsHenning, Thomas Louis (Co-author) / Zhang, Jingbo (Co-author) / Simonson, Mark (Thesis director) / Wendell, Licon (Committee member) / School of Mathematical and Statistical Sciences (Contributor) / Department of Finance (Contributor) / Barrett, The Honors College (Contributor)

Created2019-05

Regression Analysis on Colony Collapse Disorder in the United States

Description

In the last decade, the population of honey bees across the globe has declined sharply leaving scientists and bee keepers to wonder why? Amongst all nations, the United States has seen some of the greatest declines in the last 10 plus years. Without a definite explanation, Colony Collapse Disorder (CCD)…

In the last decade, the population of honey bees across the globe has declined sharply leaving scientists and bee keepers to wonder why? Amongst all nations, the United States has seen some of the greatest declines in the last 10 plus years. Without a definite explanation, Colony Collapse Disorder (CCD) was coined to explain the sudden and sharp decline of the honey bee colonies that beekeepers were experiencing. Colony collapses have been rising higher compared to expected averages over the years, and during the winter season losses are even more severe than what is normally acceptable. There are some possible explanations pointing towards meteorological variables, diseases, and even pesticide usage. Despite the cause of CCD being unknown, thousands of beekeepers have reported their losses, and even numbers of infected colonies and colonies under certain stressors in the most recent years. Using the data that was reported to The United States Department of Agriculture (USDA), as well as weather data collected by The National Centers for Environmental Information (NOAA) and the National Centers for Environmental Information (NCEI), regression analysis was used to investigate honey bee colonies to find relationships between stressors in honey bee colonies and meteorological variables, and colony collapses during the winter months. The regression analysis focused on the winter season, or quarter 4 of the year, which includes the months of October, November, and December. In the model, the response variables was the percentage of colonies lost in quarter 4. Through the model, it was concluded that certain weather thresholds and the percentage increase of colonies under certain stressors were related to colony loss.

ContributorsVasquez, Henry Antony (Author) / Zheng, Yi (Thesis director) / Saffell, Erinanne (Committee member) / School of Mathematical and Statistical Sciences (Contributor) / Barrett, The Honors College (Contributor)

Created2018-05

Galam's Voting Systems and Public Debate Models Revisited

Description

Serge Galams voting systems and public debate models are used to model voting behaviors of two competing opinions in democratic societies. Galam assumes that individuals in the population are independently in favor of one opinion with a fixed probability p, making the initial number of that type of opinion a…

Serge Galams voting systems and public debate models are used to model voting behaviors of two competing opinions in democratic societies. Galam assumes that individuals in the population are independently in favor of one opinion with a fixed probability p, making the initial number of that type of opinion a binomial random variable. This analysis revisits Galams models from the point of view of the hypergeometric random variable by assuming the initial number of individuals in favor of an opinion is a fixed deterministic number. This assumption is more realistic, especially when analyzing small populations. Evolution of the models is based on majority rules, with a bias introduced when there is a tie. For the hier- archical voting system model, in order to derive the probability that opinion +1 would win, the analysis was done by reversing time and assuming that an individual in favor of opinion +1 wins. Then, working backwards we counted the number of configurations at the next lowest level that could induce each possible configuration at the level above, and continued this process until reaching the bottom level, i.e., the initial population. Using this method, we were able to derive an explicit formula for the probability that an individual in favor of opinion +1 wins given any initial count of that opinion, for any group size greater than or equal to three. For the public debate model, we counted the total number of individuals in favor of opinion +1 at each time step and used this variable to define a random walk. Then, we used first-step analysis to derive an explicit formula for the probability that an individual in favor of opinion +1 wins given any initial count of that opinion for group sizes of three. The spatial public debate model evolves based on the proportional rule. For the spatial model, the most natural graphical representation to construct the process results in a model that is not mathematically tractable. Thus, we defined a different graphical representation that is mathematically equivalent to the first graphical representation, but in this model it is possible to define a dual process that is mathematically tractable. Using this graphical representation we prove clustering in 1D and 2D and coexistence in higher dimensions following the same approach as for the voter model interacting particle system.

ContributorsTaylor, Nicole Robyn (Co-author) / Lanchier, Nicolas (Co-author, Thesis director) / Smith, Hal (Committee member) / Hurlbert, Glenn (Committee member) / Barrett, The Honors College (Contributor) / School of Mathematical and Statistical Sciences (Contributor)

Created2013-05

The Adaptive Lasso Procedure for Building a Traffic Forecasting Model

Description

This paper will begin by initially discussing the potential uses and challenges of efficient and accurate traffic forecasting. The data we used includes traffic volume from seven locations on a busy Athens street in April and May of 2000. This data was used as part of a traffic forecasting competition.…

This paper will begin by initially discussing the potential uses and challenges of efficient and accurate traffic forecasting. The data we used includes traffic volume from seven locations on a busy Athens street in April and May of 2000. This data was used as part of a traffic forecasting competition. Our initial observation, was that due to the volatility and oscillating nature of daily traffic volume, simple linear regression models will not perform well in predicting the time-series data. For this we present the Harmonic Time Series model. Such model (assuming all predictors are significant) will include a sinusoidal term for each time index within a period of data. Our assumption is that traffic volumes have a period of one week (which is evidenced by the graphs reproduced in our paper). This leads to a model that has 6,720 sine and cosine terms. This is clearly too many coefficients, so in an effort to avoid over-fitting and having an efficient model, we apply the sub-setting algorithm known as Adaptive Lass.

ContributorsMora, Juan (Author) / Kamarianakis, Ioannis (Thesis director) / Yu, Wanchunzi (Committee member) / W. P. Carey School of Business (Contributor) / School of Mathematical and Statistical Sciences (Contributor) / Barrett, The Honors College (Contributor)

Created2017-05

Analytics in the National Hockey League (NHL): A Regression of Goals For, Goals Against and Wins from 2005-2017

Description

This paper attempts to introduce analytics and regression techniques into the National Hockey League. Hockey as a sport has been a slow adapter of analytics, and this can be attributed to poor data collection methods. Using data collected for hockeyreference.com, and R statistical software, the number of wins a team…

This paper attempts to introduce analytics and regression techniques into the National Hockey League. Hockey as a sport has been a slow adapter of analytics, and this can be attributed to poor data collection methods. Using data collected for hockeyreference.com, and R statistical software, the number of wins a team experiences will be predicted using Goals For and Goals Against statistics from 2005-2017. The model showed statistical significance and strong normality throughout the data. The number of wins each team was expected to experience in 2016-2017 was predicted using the model and then compared to the actual number of games each team won. To further analyze the validity of the model, the expected playoff outcome for 2016-2017 was compared to the observed playoff outcome. The discussion focused on team's that did not fit the model or traditional analytics and expected forecasts. The possible discrepancies were analyzed using the Las Vegas Golden Knights as a case study. Possible next steps for data analysis are presented and the role of future technology and innovation in hockey analytics is discussed and predicted.

ContributorsVermeer, Brandon Elliot (Author) / Goegan, Brian (Thesis director) / Eaton, John (Committee member) / School of Mathematical and Statistical Sciences (Contributor) / Department of Finance (Contributor) / Barrett, The Honors College (Contributor)

Created2018-05

Utilizing Machine Learning Methods to Model Cryptocurrency

Description

Cryptocurrencies have become one of the most fascinating forms of currency and economics due to their fluctuating values and lack of centralization. This project attempts to use machine learning methods to effectively model in-sample data for Bitcoin and Ethereum using rule induction methods. The dataset is cleaned by removing entries…

Cryptocurrencies have become one of the most fascinating forms of currency and economics due to their fluctuating values and lack of centralization. This project attempts to use machine learning methods to effectively model in-sample data for Bitcoin and Ethereum using rule induction methods. The dataset is cleaned by removing entries with missing data. The new column is created to measure price difference to create a more accurate analysis on the change in price. Eight relevant variables are selected using cross validation: the total number of bitcoins, the total size of the blockchains, the hash rate, mining difficulty, revenue from mining, transaction fees, the cost of transactions and the estimated transaction volume. The in-sample data is modeled using a simple tree fit, first with one variable and then with eight. Using all eight variables, the in-sample model and data have a correlation of 0.6822657. The in-sample model is improved by first applying bootstrap aggregation (also known as bagging) to fit 400 decision trees to the in-sample data using one variable. Then the random forests technique is applied to the data using all eight variables. This results in a correlation between the model and data of 9.9443413. The random forests technique is then applied to an Ethereum dataset, resulting in a correlation of 9.6904798. Finally, an out-of-sample model is created for Bitcoin and Ethereum using random forests, with a benchmark correlation of 0.03 for financial data. The correlation between the training model and the testing data for Bitcoin was 0.06957639, while for Ethereum the correlation was -0.171125. In conclusion, it is confirmed that cryptocurrencies can have accurate in-sample models by applying the random forests method to a dataset. However, out-of-sample modeling is more difficult, but in some cases better than typical forms of financial data. It should also be noted that cryptocurrency data has similar properties to other related financial datasets, realizing future potential for system modeling for cryptocurrency within the financial world.

ContributorsBrowning, Jacob Christian (Author) / Meuth, Ryan (Thesis director) / Jones, Donald (Committee member) / McCulloch, Robert (Committee member) / Computer Science and Engineering Program (Contributor) / School of Mathematical and Statistical Sciences (Contributor) / Barrett, The Honors College (Contributor)

Created2018-05

A Study of Two Models of Competitive Interaction

Description

We study two models of a competitive game in which players continuously receive points and wager them in one-on-one battles. In each model the loser of a battle has their points reset, while the points the winner receives is what sets the two models apart. In the knockout model the…

We study two models of a competitive game in which players continuously receive points and wager them in one-on-one battles. In each model the loser of a battle has their points reset, while the points the winner receives is what sets the two models apart. In the knockout model the winner receives no new points, while in the winner-takes-all model the points that the loser had are added to the winner's total. Recurrence properties are assessed for both models: the knockout model is recurrent except for the all-zero state, and the winner-takes-all model is transient, but retains some aspect of recurrence. In addition, we study the population-level allocation of points; for the winner-takes-all model we show explicitly that the proportion of individuals having any number j of points, j=0,1,... approaches a stationary distribution that can be computed recursively. Graphs of numerical simulations are included to exemplify the results proved.

ContributorsVanKirk, Maxwell Joshua (Author) / Lanchier, Nicolas (Thesis director) / Foxall, Eric (Committee member) / Barrett, The Honors College (Contributor) / School of Mathematical and Statistical Sciences (Contributor)

Created2016-12

Filtering by