Search Content

Analysis of BoostOR: A Twitter Bot Detection Classification Algorithm

Description

The prevalence of bots, or automated accounts, on social media is a well-known problem. Some of the ways bots harm social media users include, but are not limited to, spreading misinformation, influencing topic discussions, and dispersing harmful links. Bots have affected the field of disaster relief on social media as…

The prevalence of bots, or automated accounts, on social media is a well-known problem. Some of the ways bots harm social media users include, but are not limited to, spreading misinformation, influencing topic discussions, and dispersing harmful links. Bots have affected the field of disaster relief on social media as well. These bots cause problems such as preventing rescuers from determining credible calls for help, spreading fake news and other malicious content, and generating large amounts of content which burdens rescuers attempting to provide aid in the aftermath of disasters. To address these problems, this research seeks to detect bots participating in disaster event related discussions and increase the recall, or number of bots removed from the network, of Twitter bot detection methods. The removal of these bots will also prevent human users from accidentally interacting with these bot accounts and being manipulated by them. To accomplish this goal, an existing bot detection classification algorithm known as BoostOR was employed. BoostOR is an ensemble learning algorithm originally modeled to increase bot detection recall in a dataset and it has the possibility to solve the social media bot dilemma where there may be several different types of bots in the data. BoostOR was first introduced as an adjustment to existing ensemble classifiers to increase recall. However, after testing the BoostOR algorithm on unobserved datasets, results showed that BoostOR does not perform as expected. This study attempts to improve the BoostOR algorithm by comparing it with a baseline classification algorithm, AdaBoost, and then discussing the intentional differences between the two. Additionally, this study presents the main factors which contribute to the shortcomings of the BoostOR algorithm and proposes a solution to improve it. These recommendations should ensure that the BoostOR algorithm can be applied to new and unobserved datasets in the future.

ContributorsDavis, Matthew William (Author) / Liu, Huan (Thesis director) / Nazer, Tahora H. (Committee member) / Computer Science and Engineering Program (Contributor, Contributor) / Department of Information Systems (Contributor) / Barrett, The Honors College (Contributor)

Created2018-12

Using Machine Learning to Predict the NBA

Description

Machine learning is one of the fastest growing fields and it has applications in almost any industry. Predicting sports games is an obvious use case for machine learning, data is relatively easy to collect, generally complete data is available, and outcomes are easily measurable. Predicting the outcomes of sports events…

Machine learning is one of the fastest growing fields and it has applications in almost any industry. Predicting sports games is an obvious use case for machine learning, data is relatively easy to collect, generally complete data is available, and outcomes are easily measurable. Predicting the outcomes of sports events may also be easily profitable, predictions can be taken to a sportsbook and wagered on. A successful prediction model could easily turn a profit. The goal of this project was to build a model using machine learning to predict the outcomes of NBA games.
In order to train the model, data was collected from the NBA statistics website. The model was trained on games dating from the 2010 NBA season through the 2017 NBA season. Three separate models were built, predicting the winner, predicting the total points, and finally predicting the margin of victory for a team. These models learned on 80 percent of the data and validated on the other 20 percent. These models were trained for 40 epochs with a batch size of 15.
The model for predicting the winner achieved an accuracy of 65.61 percent, just slightly below the accuracy of other experts in the field of predicting the NBA. The model for predicting total points performed decently as well, it could beat Las Vegas’ prediction 50.04 percent of the time. The model for predicting margin of victory also did well, it beat Las Vegas 50.58 percent of the time.

ContributorsGarretson, Samuel Burkett (Author) / Meuth, Ryan (Thesis director) / Nakamura, Mutsumi (Committee member) / Computer Science and Engineering Program (Contributor) / Dean, W.P. Carey School of Business (Contributor) / Department of Management and Entrepreneurship (Contributor) / Barrett, The Honors College (Contributor)

Created2019-05

Using Natural Language Processing to Identify Questions and Answers Written by People Addicted to Opioids

Description

Background: Natural Language Processing models have been trained to locate questions and answers in forum settings before but on topics such as cancer and diabetes. Also, studies have used filtering methods to understand themes in forum settings regarding opioid use. However, studies have not been conducted regarding training an NLP…

Background: Natural Language Processing models have been trained to locate questions and answers in forum settings before but on topics such as cancer and diabetes. Also, studies have used filtering methods to understand themes in forum settings regarding opioid use. However, studies have not been conducted regarding training an NLP model to locate the questions people addicted to opioids are asking their peers and the answers they are receiving in forums. There are a variety of annotation tools available to help aid the data collection to train NLP models. For academic purposes, brat is the best tool for this purpose. This study will inform clinical practice by indicating what the inner thoughts of their patients who are addicted to opioids are so that they will be able to have more meaningful conversations during appointments that the patient may be too afraid to start.

Methods: The standard NLP process was used for this study in which a gold standard was reached through matched paired annotations of the forum text in brat and a neural network was trained on the content. Following the annotation process, adjudication occurred to increase the inter-annotator agreement. Categories were developed by local physicians to describe the questions and three pilots were run to test the best way to categorize the questions.

Results: The inter-annotator agreement, calculated via F-score, before adjudication for a 0.7 threshold was 0.378 for the annotation activity. After adjudication at a threshold of 0.7, the inter-annotator agreement increased to 0.560. Pilots 1, 2, and 3 of the categorization activity had an inter-annotator agreement of 0.375, 0.5, and 0.966 respectively.

Discussion: The inter-annotator agreement of the annotation activity may have been low initially since the annotators were students who may have not been as invested in the project as necessary to accurately annotate the text. Also, as everyone interprets the text slightly differently, it is possible that that contributed to the differences in the matched pairs’ annotations. The F-score variation for the categorization activity partially had to do with different delivery systems of the instructions and partially with the area of study of the participants. The first pilot did not mandate the use of the original context located in brat and the instructions were provided in the form of a downloadable document. The participants were computer science graduate students. The second pilot also had the instructions delivered via a document, but it was strongly suggested that the context be used to gain an understanding of the questions’ meanings. The participants were also computer science graduate students who upon a discussion of their results after the pilot expressed that they did not have a good understanding of the medical jargon in the posts. The final pilot used a combination of students with and without medical background, required to use the context, and included verbal instructions in combination with the written ones. The combination of these factors increased the F-score significantly. For a full-scale experiment, students with a medical background should be used to categorize the questions.

ContributorsPawlik, Katie (Author) / Devarakonda, Murthy (Thesis director) / Murcko, Anita (Committee member) / Green, Ellen (Committee member) / College of Health Solutions (Contributor) / Barrett, The Honors College (Contributor)

Created2019-12

Predicting Outcome of a Pitch Given the Type of Pitch for any Baseball Scenario

Description

This thesis serves as a baseline for the potential for prediction through machine learning (ML) in baseball. Hopefully, it also will serve as motivation for future work to expand and reach the potential of sabermetrics, advanced Statcast data and machine learning. The problem this thesis attempts to solve is predicting…

This thesis serves as a baseline for the potential for prediction through machine learning (ML) in baseball. Hopefully, it also will serve as motivation for future work to expand and reach the potential of sabermetrics, advanced Statcast data and machine learning. The problem this thesis attempts to solve is predicting the outcome of a pitch. Given proper pitch data and situational data, is it possible to predict the result or outcome of a pitch? The result or outcome refers to the specific outcome of a pitch, beyond ball or strike, but if the hitter puts the ball in play for a double, this thesis shows how I attempted to predict that type of outcome. Before diving into my methods, I take a deep look into sabermetrics, advanced statistics and the history of the two in Major League Baseball. After this, I describe my implemented machine learning experiment. First, I found a dataset that is suitable for training a pitch prediction model, I then analyzed the features and used some feature engineering to select a set of 16 features, and finally, I trained and tested a pair of ML models on the data. I used a decision tree classifier and random forest classifier to test the data. I attempted to us a long short-term memory to improve my score, but came up short. Each classifier performed at around 60% accuracy. I also experimented using a neural network approach with a long short-term memory (LSTM) model, but this approach requires more feature engineering to beat the simpler classifiers. In this thesis, I show examples of five hitters that I test the models on and the accuracy for each hitter. This work shows promise that advanced classification models (likely requiring more feature engineering) can provide even better prediction outcomes, perhaps with 70% accuracy or higher! There is much potential for future work and to improve on this thesis, mainly through the proper construction of a neural network, more in-depth feature analysis/selection/extraction, and data visualization.

ContributorsGoodman, Avi (Author) / Bryan, Chris (Thesis director) / Hsiao, Sharon (Committee member) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2020-05

Machine Learning: A Sentiment Analysis of Customer Reviews

Description

Machine learning is the process of training a computer with algorithms to learn from data and make informed predictions. In a world where large amounts of data are constantly collected, machine learning is an important tool to analyze this data to find patterns and learn useful information from it. Machine…

Machine learning is the process of training a computer with algorithms to learn from data and make informed predictions. In a world where large amounts of data are constantly collected, machine learning is an important tool to analyze this data to find patterns and learn useful information from it. Machine learning applications expand to numerous fields; however, I chose to focus on machine learning with a business perspective for this thesis, specifically e-commerce.

The e-commerce market utilizes information to target customers and drive business. More and more online services have become available, allowing consumers to make purchases and interact with an online system. For example, Amazon is one of the largest Internet-based retail companies. As people shop through this website, Amazon gathers huge amounts of data on its customers from personal information to shopping history to viewing history. After purchasing a product, the customer may leave reviews and give a rating based on their experience. Performing analytics on all of this data can provide insights into making more informed business and marketing decisions that can lead to business growth and also improve the customer experience.
For this thesis, I have trained binary classification models on a publicly available product review dataset from Amazon to predict whether a review has a positive or negative sentiment. The sentiment analysis process includes analyzing and encoding the human language, then extracting the sentiment from the resulting values. In the business world, sentiment analysis provides value by revealing insights into customer opinions and their behaviors. In this thesis, I will explain how to perform a sentiment analysis and analyze several different machine learning models. The algorithms for which I compared the results are KNN, Logistic Regression, Decision Trees, Random Forest, Naïve Bayes, Linear Support Vector Machines, and Support Vector Machines with an RBF kernel.

ContributorsMadaan, Shreya (Author) / Meuth, Ryan (Thesis director) / Nakamura, Mutsumi (Committee member) / Computer Science and Engineering Program (Contributor, Contributor) / Dean, W.P. Carey School of Business (Contributor) / Barrett, The Honors College (Contributor)

Created2020-05

Wireless Machine-learning Enabled Reconfigurable ""Button-type"" Pressure Sensors for Gait Analysis

Description

This paper introduces a wireless reconfigurable “button-type” pressure sensor system, via machine learning, for gait analysis application. The pressure sensor system consists of an array of independent button-type pressure sensing units interfaced with a remote computer. The pressure sensing unit contains pressure-sensitive resistors, readout electronics, and a wireless Bluetooth module,…

This paper introduces a wireless reconfigurable “button-type” pressure sensor system, via machine learning, for gait analysis application. The pressure sensor system consists of an array of independent button-type pressure sensing units interfaced with a remote computer. The pressure sensing unit contains pressure-sensitive resistors, readout electronics, and a wireless Bluetooth module, which are assembled within footprint of 40 × 25 × 6mm3. The small-footprint, low-profile sensors are populated onto a shoe insole, like buttons, to collect temporal pressure data. The pressure sensing unit measures pressures up to 2,000 kPa while maintaining an error under 10%. The reconfigurable pressure sensor array reduces the total power consumption of the system by 50%, allowing extended period of operation, up to 82.5 hrs. A robust machine learning program identifies the optimal pressure sensing units in any given configuration at an accuracy of up to 98%.

ContributorsBooth, Jayden Charles (Author) / Chae, Junseok (Thesis director) / Chen, Ang (Committee member) / Electrical Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2018-12

Utilizing Machine Learning Methods to Model Cryptocurrency

Description

Cryptocurrencies have become one of the most fascinating forms of currency and economics due to their fluctuating values and lack of centralization. This project attempts to use machine learning methods to effectively model in-sample data for Bitcoin and Ethereum using rule induction methods. The dataset is cleaned by removing entries…

Cryptocurrencies have become one of the most fascinating forms of currency and economics due to their fluctuating values and lack of centralization. This project attempts to use machine learning methods to effectively model in-sample data for Bitcoin and Ethereum using rule induction methods. The dataset is cleaned by removing entries with missing data. The new column is created to measure price difference to create a more accurate analysis on the change in price. Eight relevant variables are selected using cross validation: the total number of bitcoins, the total size of the blockchains, the hash rate, mining difficulty, revenue from mining, transaction fees, the cost of transactions and the estimated transaction volume. The in-sample data is modeled using a simple tree fit, first with one variable and then with eight. Using all eight variables, the in-sample model and data have a correlation of 0.6822657. The in-sample model is improved by first applying bootstrap aggregation (also known as bagging) to fit 400 decision trees to the in-sample data using one variable. Then the random forests technique is applied to the data using all eight variables. This results in a correlation between the model and data of 9.9443413. The random forests technique is then applied to an Ethereum dataset, resulting in a correlation of 9.6904798. Finally, an out-of-sample model is created for Bitcoin and Ethereum using random forests, with a benchmark correlation of 0.03 for financial data. The correlation between the training model and the testing data for Bitcoin was 0.06957639, while for Ethereum the correlation was -0.171125. In conclusion, it is confirmed that cryptocurrencies can have accurate in-sample models by applying the random forests method to a dataset. However, out-of-sample modeling is more difficult, but in some cases better than typical forms of financial data. It should also be noted that cryptocurrency data has similar properties to other related financial datasets, realizing future potential for system modeling for cryptocurrency within the financial world.

ContributorsBrowning, Jacob Christian (Author) / Meuth, Ryan (Thesis director) / Jones, Donald (Committee member) / McCulloch, Robert (Committee member) / Computer Science and Engineering Program (Contributor) / School of Mathematical and Statistical Sciences (Contributor) / Barrett, The Honors College (Contributor)

Created2018-05

Comparison of sentiment analysis systems and an application in signed link prediction

Description

Social media sites are platforms in which individuals discuss a wide range of topics and share a huge amount of information about themselves and their interests. So much of this information is encoded through unstructured text that users post on the these types of sites. There has been a considerable…

Social media sites are platforms in which individuals discuss a wide range of topics and share a huge amount of information about themselves and their interests. So much of this information is encoded through unstructured text that users post on the these types of sites. There has been a considerable amount of work done in respect to sentiment analysis on these sites to infer users' opinions and preferences. However there is a gap where it may be difficult to infer how a user feels about particular pages or topics that they have not conveyed their sentiment for in a observable form. Collaborative filtering is a common method used to solve this problem with user data, but has only infrequently been used with sentiment information in order to make inferences about users preferences. In this paper we extend previous work on leveraging sentiment in collaborative filtering, specifically to approximate user sentiment and subsequently their vote for candidates in an online election. Sentiment is shown to be an effective tool for making these types of predictions in the absence of other more explicit user preference information. In addition to this, we present an evaluation of sentiment analysis methods and tools that are used in state of the art sentiment analysis systems in order to understand which of these methods to leverage in our experiments.

ContributorsBaird, James Daniel (Author) / Liu, Huan (Thesis director) / Wang, Suhang (Committee member) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2018-05

Predicting the Outcome of UFC Fights Using Machine Learning Models

Description

Abstract: The Ultimate Fighting Championship or UFC as it is commonly known, was founded in 1993 and has quickly built itself into the world's foremost authority on all things MMA (mixed martial arts) related. With pay-per-view and cable television deals in hand, the UFC has become a huge competitor in…

Abstract: The Ultimate Fighting Championship or UFC as it is commonly known, was founded in 1993 and has quickly built itself into the world's foremost authority on all things MMA (mixed martial arts) related. With pay-per-view and cable television deals in hand, the UFC has become a huge competitor in the sports market, rivaling the popularity of boxing for almost a decade. As with most other sports, the UFC has seen an influx of various analytics and data science over the past five to seven years. We see this revolution in football with the broadcast first down markers, basketball with tracking player movement, and baseball with locating pitches for strikes and balls, and now the UFC has partnered with statistics company Fightmetric, to provide in-depth statistical analysis of its fights. ESPN has their win probability metrics, and statistical predictive modeling has begun to spread throughout sports. All these stats were made to showcase the information about a fighter that one wouldn't typically know, giving insight into how the fight might go. But, can these fights be predicted? Based off of the research of prior individuals and combining the thought processes of relevant research into other sports leagues, I sought to use the arsenal of statistical analyses done by Fightmetric, along with the official UFC fighter database to answer the question of whether UFC fights could be predicted. Specifically, by using only data that would be known about a fighter prior to stepping into the cage, could I predict with any degree of certainty who was going to win the fight?

ContributorsMoorman, Taylor D. (Author) / Simon, Alan (Thesis director) / Simon, Phil (Committee member) / W.P. Carey School of Business (Contributor) / Department of Information Systems (Contributor) / Department of Management and Entrepreneurship (Contributor) / Barrett, The Honors College (Contributor)

Created2018-05

Data Management Behind Machine Learning

Description

This thesis dives into the world of artificial intelligence by exploring the functionality of a single layer artificial neural network through a simple housing price classification example while simultaneously considering its impact from a data management perspective on both the software and hardware level. To begin this study, the universally…

This thesis dives into the world of artificial intelligence by exploring the functionality of a single layer artificial neural network through a simple housing price classification example while simultaneously considering its impact from a data management perspective on both the software and hardware level. To begin this study, the universally accepted model of an artificial neuron is broken down into its key components and then analyzed for functionality by relating back to its biological counterpart. The role of a neuron is then described in the context of a neural network, with equal emphasis placed on how it individually undergoes training and then for an entire network. Using the technique of supervised learning, the neural network is trained with three main factors for housing price classification, including its total number of rooms, bathrooms, and square footage. Once trained with most of the generated data set, it is tested for accuracy by introducing the remainder of the data-set and observing how closely its computed output for each set of inputs compares to the target value. From a programming perspective, the artificial neuron is implemented in C so that it would be more closely tied to the operating system and therefore make the collected profiler data more precise during the program's execution. The program is designed to break down each stage of the neuron's training process into distinct functions. In addition to utilizing more functional code, the struct data type is used as the underlying data structure for this project to not only represent the neuron but for implementing the neuron's training and test data. Once fully trained, the neuron's test results are then graphed to visually depict how well the neuron learned from its sample training set. Finally, the profiler data is analyzed to describe how the program operated from a data management perspective on the software and hardware level.

ContributorsRichards, Nicholas Giovanni (Author) / Miller, Phillip (Thesis director) / Meuth, Ryan (Committee member) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2018-05

Filtering by