Matching Items (63)
- All Subjects: Machine Learning
- Creators: Computer Science and Engineering Program
- Member of: Barrett, The Honors College Thesis/Creative Project Collection
Bots tamper with social media networks by artificially inflating the popularity of certain topics. In this paper, we define what a bot is, we detail different motivations for bots, we describe previous work in bot detection and observation, and then we perform bot detection of our own. For our bot detection, we are interested in bots on Twitter that tweet Arabic extremist-like phrases. A testing dataset is collected using the honeypot method, and five different heuristics are measured for their effectiveness in detecting bots. The model underperformed, but we have laid the ground-work for a vastly untapped focus on bot detection: extremist ideal diffusion through bots.
With the development of technology, there has been a dramatic increase in the number of machine learning programs. These complex programs make conclusions and can predict or perform actions based off of models from previous runs or input information. However, such programs require the storing of a very large amount of data. Queries allow users to extract only the information that helps for their investigation. The purpose of this thesis was to create a system with two important components, querying and visualization. Metadata was stored in Sedna as XML and time series data was stored in OpenTSDB as JSON. In order to connect the two databases, the time series ID was stored as a metric in the XML metadata. Queries should be simple, flexible, and return all data that fits the query parameters. The query language used was an extension of XQuery FLWOR that added time series parameters. Visualization should be easily understood and be organized in a way to easily find important information and details. Because of the possibility of a large amount of data being returned from a query, a multivariate heat map was used to visualize the time series results. The two programs that the system performed queries on was Energy Plus and Epidemic Simulation Data Management System. By creating such a system, it would be easier for people of the project's fields to find the relationship between metadata that leads to the desired results over time. Over the time of the thesis project, the overall software was completed, however the software must be optimized in order to take the enormous amount of data expected from the system.
Twitter, the microblogging platform, has grown in prominence to the point that the topics that trend on the network are often the subject of the news and other traditional media. By predicting trends on Twitter, it could be possible to predict the next major topic of interest to the public. With this motivation, this paper develops a model for trends leveraging previous work with k-nearest-neighbors and dynamic time warping. The development of this model provides insight into the length and features of trends, and successfully generalizes to identify 74.3% of trends in the time period of interest. The model developed in this work provides understanding into why par- ticular words trend on Twitter.
Food safety is vital to the well-being of society; therefore, it is important to inspect food products to ensure minimal health risks are present. A crucial phase of food inspection is the identification of foreign particles found in the sample, such as insect body parts. The presence of certain species of insects, especially storage beetles, is a reliable indicator of possible contamination during storage and food processing. However, the current approach to identifying species is visual examination by human analysts; this method is rather subjective and time-consuming. Furthermore, confident identification requires extensive experience and training. To aid this inspection process, we have developed in collaboration with FDA analysts some image analysis-based machine intelligence to achieve species identification with up to 90% accuracy. The current project is a continuation of this development effort. Here we present an image analysis environment that allows practical deployment of the machine intelligence on computers with limited processing power and memory. Using this environment, users can prepare input sets by selecting images for analysis, and inspect these images through the integrated pan, zoom, and color analysis capabilities. After species analysis, the results panel allows the user to compare the analyzed images with referenced images of the proposed species. Further additions to this environment should include a log of previously analyzed images, and eventually extend to interaction with a central cloud repository of images through a web-based interface. Additional issues to address include standardization of image layout, extension of the feature-extraction algorithm, and utilizing image classification to build a central search engine for widespread usage.
Due to the popularity of the movie industry, a film's opening weekend box-office performance is of great interest not only to movie studios, but to the general public, as well. In hopes of maximizing a film's opening weekend revenue, movie studios invest heavily in pre-release advertisement. The most visible advertisement is the movie trailer, which, in no more than two minutes and thirty seconds, serves as many people's first introduction to a film. The question, however, is how can we be confident that a trailer will succeed in its promotional task, and bring about the audience a studio expects? In this thesis, we use machine learning classification techniques to determine the effectiveness of a movie trailer in the promotion of its namesake. We accomplish this by creating a predictive model that automatically analyzes the audio and visual characteristics of a movie trailer to determine whether or not a film's opening will be successful by earning at least 35% of a film's production budget during its first U.S. box office weekend. Our predictive model performed reasonably well, achieving an accuracy of 68.09% in a binary classification. Accuracy increased to 78.62% when including genre in our predictive model.
Standardization is sorely lacking in the field of musical machine learning. This thesis project endeavors to contribute to this standardization by training three machine learning models on the same dataset and comparing them using the same metrics. The music-specific metrics utilized provide more relevant information for diagnosing the shortcomings of each model.
This project aims to incorporate the aspect of sentiment analysis into traditional stock analysis to enhance stock rating predictions by applying a reliance on the opinion of various stocks from the Internet. Headlines from eight major news publications and conversations from Yahoo! Finance’s “Conversations” feature were parsed through the Valence Aware Dictionary for Sentiment Reasoning (VADER) natural language processing package to determine numerical polarities which represented positivity or negativity for a given stock ticker. These generated polarities were paired with stock metrics typically observed by stock analysts as the feature set for a Logistic Regression machine learning model. The model was trained on roughly 1500 major stocks to determine a binary classification between a “Buy” or “Not Buy” rating for each stock, and the results of the model were inserted into the back-end of the Agora Web UI which emulates search engine behavior specifically for stocks found in NYSE and NASDAQ. The model reported an accuracy of 82.5% and for most major stocks, the model’s prediction correlated with stock analysts’ ratings. Given the volatility of the stock market and the propensity for hive-mind behavior in online forums, the performance of the Logistic Regression model would benefit from incorporating historical stock data and more sources of opinion to balance any subjectivity in the model.
This thesis project focuses on the creation and assessment of the "Simple Stocks" app, a straightforward investment tool specifically developed for people who are new to investing and find it challenging to comprehend the complexities of the stock market. We identified a significant gap in the availability of easy-to-understand resources and information for beginner investors, which led us to design an app that provides clear and simple data, professional advice from financial analysts, and an advanced machine learning feature to predict stock trends. The "Simple Stocks" app also incorporates a voting feature, allowing users to see what other investors think about specific stocks. This functionality not only helps users make informed decisions but also encourages a sense of community, as users can learn from each other's experiences and opinions. By creating a supportive environment, the app promotes a more approachable and enjoyable experience for those who are new to investing. Following the successful release of the "Simple Stocks'' app on the App Store, our current objectives include expanding the user base and looking into various ways to generate income. One possible approach is to collaborate with other companies and establish an advertising-based revenue model, which would benefit both parties by attracting more users and increasing profits.
Historically, the predominant strategy for evaluating baseball pitchers has been through statistics created directly from the offensive production against the pitcher, such as ERA. Such statistics are inherently relative to the abilities and competition level of the opposing offense and the field defense, which the pitcher has no control over, making it difficult to compare pitchers across leagues. In this paper, I use cutting edge pitch-tracking data to develop a pitch evaluation model that is intrinsic to the attributes of the pitches themselves, and not influenced directly by the outcomes of each individual pitch. I train four different classifiers to predict the probability of each pitch belonging to different subsets of outcomes, then multiply the probability of each outcome by that outcome’s average run value to arrive at an expected run value for the pitch. I compare the performance of each classifier to a baseline, examine the most impactful features, and compare the top pitchers identified by the model to those identified by a different baseball statistics resource, ultimately concluding that three of the four classification models are productive and that the overall intrinsic evaluation model accurately identifies the sports top performers.
2018, Google researchers published the BERT (Bidirectional Encoder Representations from Transformers) model, which has since served as a starting point for hundreds of NLP (Natural Language Processing) related experiments and other derivative models. BERT was trained on masked-language modelling (sentence prediction) but its capabilities extend to more common NLP tasks, such as language inference and text classification. Naralytics is a company that seeks to use natural language in order to be able to categorize users who create text into multiple categories – which is a modified version of classification. However, the text that Naralytics seeks to pull from exceed the maximum token length of 512 tokens that BERT supports – so this report discusses the research towards multiple BERT derivatives that seek to address this problem – and then implements a solution that addresses the multiple concerns that are attached to this kind of model.