Matching Items (48)

Filtering by

Clear all filters

133339-Thumbnail Image.png

Prescription Information Extraction from Electronic Health Records using BiLSTM-CRF and Word Embeddings

Description

Medical records are increasingly being recorded in the form of electronic health records (EHRs), with a significant amount of patient data recorded as unstructured natural language text. Consequently, being able to extract and utilize clinical data present within these records

Medical records are increasingly being recorded in the form of electronic health records (EHRs), with a significant amount of patient data recorded as unstructured natural language text. Consequently, being able to extract and utilize clinical data present within these records is an important step in furthering clinical care. One important aspect within these records is the presence of prescription information. Existing techniques for extracting prescription information — which includes medication names, dosages, frequencies, reasons for taking, and mode of administration — from unstructured text have focused on the application of rule- and classifier-based methods. While state-of-the-art systems can be effective in extracting many types of information, they require significant effort to develop hand-crafted rules and conduct effective feature engineering. This paper presents the use of a bidirectional LSTM with CRF tagging model initialized with precomputed word embeddings for extracting prescription information from sentences without requiring significant feature engineering. The experimental results, run on the i2b2 2009 dataset, achieve an F1 macro measure of 0.8562, and scores above 0.9449 on four of the six categories, indicating significant potential for this model.

Contributors

Agent

Created

Date Created
2018-05

133901-Thumbnail Image.png

Data Management Behind Machine Learning

Description

This thesis dives into the world of artificial intelligence by exploring the functionality of a single layer artificial neural network through a simple housing price classification example while simultaneously considering its impact from a data management perspective on both the

This thesis dives into the world of artificial intelligence by exploring the functionality of a single layer artificial neural network through a simple housing price classification example while simultaneously considering its impact from a data management perspective on both the software and hardware level. To begin this study, the universally accepted model of an artificial neuron is broken down into its key components and then analyzed for functionality by relating back to its biological counterpart. The role of a neuron is then described in the context of a neural network, with equal emphasis placed on how it individually undergoes training and then for an entire network. Using the technique of supervised learning, the neural network is trained with three main factors for housing price classification, including its total number of rooms, bathrooms, and square footage. Once trained with most of the generated data set, it is tested for accuracy by introducing the remainder of the data-set and observing how closely its computed output for each set of inputs compares to the target value. From a programming perspective, the artificial neuron is implemented in C so that it would be more closely tied to the operating system and therefore make the collected profiler data more precise during the program's execution. The program is designed to break down each stage of the neuron's training process into distinct functions. In addition to utilizing more functional code, the struct data type is used as the underlying data structure for this project to not only represent the neuron but for implementing the neuron's training and test data. Once fully trained, the neuron's test results are then graphed to visually depict how well the neuron learned from its sample training set. Finally, the profiler data is analyzed to describe how the program operated from a data management perspective on the software and hardware level.

Contributors

Agent

Created

Date Created
2018-05

132774-Thumbnail Image.png

Using Machine Learning to Predict the NBA

Description

Machine learning is one of the fastest growing fields and it has applications in almost any industry. Predicting sports games is an obvious use case for machine learning, data is relatively easy to collect, generally complete data is available, and

Machine learning is one of the fastest growing fields and it has applications in almost any industry. Predicting sports games is an obvious use case for machine learning, data is relatively easy to collect, generally complete data is available, and outcomes are easily measurable. Predicting the outcomes of sports events may also be easily profitable, predictions can be taken to a sportsbook and wagered on. A successful prediction model could easily turn a profit. The goal of this project was to build a model using machine learning to predict the outcomes of NBA games.
In order to train the model, data was collected from the NBA statistics website. The model was trained on games dating from the 2010 NBA season through the 2017 NBA season. Three separate models were built, predicting the winner, predicting the total points, and finally predicting the margin of victory for a team. These models learned on 80 percent of the data and validated on the other 20 percent. These models were trained for 40 epochs with a batch size of 15.
The model for predicting the winner achieved an accuracy of 65.61 percent, just slightly below the accuracy of other experts in the field of predicting the NBA. The model for predicting total points performed decently as well, it could beat Las Vegas’ prediction 50.04 percent of the time. The model for predicting margin of victory also did well, it beat Las Vegas 50.58 percent of the time.

Contributors

Created

Date Created
2019-05

132430-Thumbnail Image.png

Twitch Streamer-Game Recommender System

Description

Abstract
Matrix Factorization techniques have been proven to be more effective in recommender systems than standard user based or item based methods. Using this knowledge, Funk SVD and SVD++ are compared by the accuracy of their predictions of Twitch streamer

Abstract
Matrix Factorization techniques have been proven to be more effective in recommender systems than standard user based or item based methods. Using this knowledge, Funk SVD and SVD++ are compared by the accuracy of their predictions of Twitch streamer data.

Introduction
As watching video games is becoming more popular, those interested are becoming interested in Twitch.tv, an online platform for guests to watch streamers play video games and interact with them. A streamer is an person who broadcasts them-self playing a video game or some other thing for an audience (the guests of the website.) The site allows the guest to first select the game/category to view and then displays currently active streamers for the guest to select and watch. Twitch records the games that a streamer plays along with the amount of time that a streamer spends streaming that game. This is how the score is generated for a streamer’s game. These three terms form the streamer-game-score (user-item-rating) tuples that we use to train out models.
The our problem’s solution is similar to the purpose of the Netflix prize; however, as opposed to suggesting a user a movie, the goal is to suggest a user a game. We built a model to predict the score that a streamer will have for a game. The score field in our data is fundamentally different from a movie rating in Netflix because the way a user influences a game’s score is by actively streaming it, not by giving it an score based off opinion. The dataset being used it the Twitch.tv dataset provided by Isaac Jones [1]. Also, the only data used in training the models is in the form of the streamer-game-score (user-item-rating) tuples. It will be known if these data points with limited information will be able to give an accurate prediction of a streamer’s score for a game. SVD and SVD++ are the baseis of the models being trained and tested. Scikit’s Surprise library in Python3 is used for the implementation of the models.

Contributors

Agent

Created

Date Created
2019-05

Detecting Propaganda Bots on Twitter Using Machine Learning

Description

Propaganda bots are malicious bots on Twitter that spread divisive opinions and support political accounts. This project is based on detecting propaganda bots on Twitter using machine learning. Once I began to observe patterns within propaganda followers on

Propaganda bots are malicious bots on Twitter that spread divisive opinions and support political accounts. This project is based on detecting propaganda bots on Twitter using machine learning. Once I began to observe patterns within propaganda followers on Twitter, I determined that I could train algorithms to detect these bots. The paper focuses on my development and process of training classifiers and using them to create a user-facing server that performs prediction functions automatically. The learning goals of this project were detailed, the focus of which was to learn some form of machine learning architecture. I needed to learn some aspect of large data handling, as well as being able to maintain these datasets for training use. I also needed to develop a server that would execute these functionalities on command. I wanted to be able to design a full-stack system that allowed me to create every aspect of a user-facing server that can execute predictions using the classifiers that I design.
Throughout this project, I decided on a number of learning goals to consider it a success. I needed to learn how to use the supporting libraries that would help me to design this system. I also learned how to use the Twitter API, as well as create the infrastructure behind it that would allow me to collect large amounts of data for machine learning. I needed to become familiar with common machine learning libraries in Python in order to create the necessary algorithms and pipelines to make predictions based on Twitter data.
This paper details the steps and decisions needed to determine how to collect this data and apply it to machine learning algorithms. I determined how to create labelled data using pre-existing Botometer ratings, and the levels of confidence I needed to label data for training. I use the scikit-learn library to create these algorithms to best detect these bots. I used a number of pre-processing routines to refine the classifiers’ precision, including natural language processing and data analysis techniques. I eventually move to remotely-hosted versions of the system on Amazon web instances to collect larger amounts of data and train more advanced classifiers. This leads to the details of my final implementation of a user-facing server, hosted on AWS and interfacing over Gmail’s IMAP server.
The current and future development of this system is laid out. This includes more advanced classifiers, better data analysis, conversions to third party Twitter data collection systems, and user features. I detail what it is I have learned from this exercise, and what it is I hope to continue working on.

Contributors

Agent

Created

Date Created
2019-05

Deep Periodic Networks

Description

In the field of machine learning, reinforcement learning stands out for its ability to explore approaches to complex, high dimensional problems that outperform even expert humans. For robotic locomotion tasks reinforcement learning provides an approach to solving them without the

In the field of machine learning, reinforcement learning stands out for its ability to explore approaches to complex, high dimensional problems that outperform even expert humans. For robotic locomotion tasks reinforcement learning provides an approach to solving them without the need for unique controllers. In this thesis, two reinforcement learning algorithms, Deep Deterministic Policy Gradient and Group Factor Policy Search are compared based upon their performance in the bipedal walking environment provided by OpenAI gym. These algorithms are evaluated on their performance in the environment and their sample efficiency.

Contributors

Agent

Created

Date Created
2018-12

132995-Thumbnail Image.png

Automatic Song Lyric Generation and Classification with Long Short-Term Networks

Description

Lyric classification and generation are trending in topics in the machine learning community. Long Short-Term Networks (LSTMs) are effective tools for classifying and generating text. We explored their effectiveness in the generation and classification of lyrical data and proposed methods

Lyric classification and generation are trending in topics in the machine learning community. Long Short-Term Networks (LSTMs) are effective tools for classifying and generating text. We explored their effectiveness in the generation and classification of lyrical data and proposed methods of evaluating their accuracy. We found that LSTM networks with dropout layers were effective at lyric classification. We also found that Word embedding LSTM networks were extremely effective at lyric generation.

Contributors

Agent

Created

Date Created
2019-05

133482-Thumbnail Image.png

Utilizing Machine Learning Methods to Model Cryptocurrency

Description

Cryptocurrencies have become one of the most fascinating forms of currency and economics due to their fluctuating values and lack of centralization. This project attempts to use machine learning methods to effectively model in-sample data for Bitcoin and Ethereum using

Cryptocurrencies have become one of the most fascinating forms of currency and economics due to their fluctuating values and lack of centralization. This project attempts to use machine learning methods to effectively model in-sample data for Bitcoin and Ethereum using rule induction methods. The dataset is cleaned by removing entries with missing data. The new column is created to measure price difference to create a more accurate analysis on the change in price. Eight relevant variables are selected using cross validation: the total number of bitcoins, the total size of the blockchains, the hash rate, mining difficulty, revenue from mining, transaction fees, the cost of transactions and the estimated transaction volume. The in-sample data is modeled using a simple tree fit, first with one variable and then with eight. Using all eight variables, the in-sample model and data have a correlation of 0.6822657. The in-sample model is improved by first applying bootstrap aggregation (also known as bagging) to fit 400 decision trees to the in-sample data using one variable. Then the random forests technique is applied to the data using all eight variables. This results in a correlation between the model and data of 9.9443413. The random forests technique is then applied to an Ethereum dataset, resulting in a correlation of 9.6904798. Finally, an out-of-sample model is created for Bitcoin and Ethereum using random forests, with a benchmark correlation of 0.03 for financial data. The correlation between the training model and the testing data for Bitcoin was 0.06957639, while for Ethereum the correlation was -0.171125. In conclusion, it is confirmed that cryptocurrencies can have accurate in-sample models by applying the random forests method to a dataset. However, out-of-sample modeling is more difficult, but in some cases better than typical forms of financial data. It should also be noted that cryptocurrency data has similar properties to other related financial datasets, realizing future potential for system modeling for cryptocurrency within the financial world.

Contributors

Agent

Created

Date Created
2018-05

132957-Thumbnail Image.png

Predicting Sneaker Resale Prices using Machine Learning

Description

This thesis dives into the world of machine learning by attempting to create an application that will accurately predict whether or not a sneaker will resell at a profit. To begin this study, I first researched different machine learning algorithms

This thesis dives into the world of machine learning by attempting to create an application that will accurately predict whether or not a sneaker will resell at a profit. To begin this study, I first researched different machine learning algorithms to determine which would be best for this project. After ultimately deciding on using an artificial neural network, I then moved on to collecting data, using StockX and Twitter. StockX is a platform where individuals can post and resell shoes, while also providing statistics and analytics about each pair of shoes. I used StockX to retrieve data about the actual shoe, which involved retrieving data for the network feature variables: gender, brand, and retail price. Additionally, I also retrieved the data for the average deadstock price for each shoe, which describes what the mean price of new, unworn shoes are selling for on StockX. This data was used with the retail price data to determine whether or not a shoe has been, on average, selling for a profit. I used Twitter’s API to retrieve links to different shoes on StockX along with retrieving the number of favorites and retweets each of those links had. These metrics were used to account for ‘hype’ of the shoe, with shoes traditionally being more profitable the larger the hype surrounding them. After preprocessing the data, I trained the model using a randomized 80% of the data. On average, the model had about a 65-70% accuracy range when tested with the remaining 20% of the data. Once the model was optimized, I saved it and uploaded it to a web application that took in user input for the five feature variables, tested the datapoint using the model, and outputted the confidence in whether or not the shoe would generate a profit.
From a technical perspective, I used Python for the whole project, while also using HTML/CSS for the front-end of the application. As for key packages, I used Keras, an open source neural network library to build the model; data preprocessing was done using sklearn’s various subpackages. All charts and graphs were done using data visualization libraries matplotlib and seaborn. These charts provided insight as to what the final dataset looked like. They showed how the brand distribution is relatively close to what it should be, while the gender distribution was heavily skewed. Future work on this project would involve expanding the dataset, automating the entirety of the data retrieval process, and finally deploying the project on the cloud for users everywhere to use the application.

Contributors

Agent

Created

Date Created
2019-05

133574-Thumbnail Image.png

Comparison of sentiment analysis systems and an application in signed link prediction

Description

Social media sites are platforms in which individuals discuss a wide range of topics and share a huge amount of information about themselves and their interests. So much of this information is encoded through unstructured text that users post on

Social media sites are platforms in which individuals discuss a wide range of topics and share a huge amount of information about themselves and their interests. So much of this information is encoded through unstructured text that users post on the these types of sites. There has been a considerable amount of work done in respect to sentiment analysis on these sites to infer users' opinions and preferences. However there is a gap where it may be difficult to infer how a user feels about particular pages or topics that they have not conveyed their sentiment for in a observable form. Collaborative filtering is a common method used to solve this problem with user data, but has only infrequently been used with sentiment information in order to make inferences about users preferences. In this paper we extend previous work on leveraging sentiment in collaborative filtering, specifically to approximate user sentiment and subsequently their vote for candidates in an online election. Sentiment is shown to be an effective tool for making these types of predictions in the absence of other more explicit user preference information. In addition to this, we present an evaluation of sentiment analysis methods and tools that are used in state of the art sentiment analysis systems in order to understand which of these methods to leverage in our experiments.

Contributors

Agent

Created

Date Created
2018-05