Search Content

PyAntiPhish: A Python-Based Machine Learning Detector of Phishing Websites and An Examination of Relevant URL-Based Features

Description

Phishing is one of most common and effective attack vectors in modern cybercrime. Rather than targeting a technical vulnerability in a computer system, phishing attacks target human behavioral or emotional tendencies through manipulative emails, text messages, or phone calls. Through PyAntiPhish, I attempt to create my own version of an…

Phishing is one of most common and effective attack vectors in modern cybercrime. Rather than targeting a technical vulnerability in a computer system, phishing attacks target human behavioral or emotional tendencies through manipulative emails, text messages, or phone calls. Through PyAntiPhish, I attempt to create my own version of an anti-phishing solution, through a series of experiments testing different machine learning classifiers and URL features. With an end-goal implementation as a Chromium browser extension utilizing Python-based machine learning classifiers (those available via the scikit-learn library), my project uses a combination of Python, TypeScript, Node.js, as well as AWS Lambda and API Gateway to act as a solution capable of blocking phishing attacks from the web browser.

ContributorsYang, Branden (Author) / Osburn, Steven (Thesis director) / Malpe, Adwith (Committee member) / Ahn, Gail-Joon (Committee member) / Barrett, The Honors College (Contributor) / Computer Science and Engineering Program (Contributor)

Created2024-05

Applications of Machine Learning to Botanical Classification

Description

In the field of botany, it is often necessary for plants to be identified based on their phenotypical characteristics, whether in person or using previously collected image samples. This work can be tedious and challenging for a human botanist to complete, as datasets can be large and several species of…

In the field of botany, it is often necessary for plants to be identified based on their phenotypical characteristics, whether in person or using previously collected image samples. This work can be tedious and challenging for a human botanist to complete, as datasets can be large and several species of plants strongly resemble each other. Various machine learning techniques, both supervised and unsupervised, can address this task with varying degrees of accuracy and efficiency thanks to their ability to identify subtle patterns in data. The objective of this research is to both conduct a review of previous studies that measure the effectiveness of various machine learning methods for plant identification and to build and test various models to draw up a comparison of the accuracies and efficiencies of the set of techniques. A review of the existing literature found that any of the studied machine learning techniques can yield a high level of accuracy when used in the correct situations and on a suitable dataset. The results gathered from the models built from this research show that all else being equal, complex convolutional neural networks perform the best on this task, yielding an accuracy of 85.4% on the larger dataset. The other models tested in descending order of accuracy on the same dataset are k-nearest neighbors, random forest, k-means clustering, and a decision tree classifier.

ContributorsOlsen, Laela (Author) / Carter, Lynn Robert (Thesis director) / Bhargav, Vishnu (Committee member) / Barrett, The Honors College (Contributor) / Computer Science and Engineering Program (Contributor)

Created2024-05

Agora: Introducing the Internet's Opinion to Traditional Stock Analysis and Prediction.

Description

This project aims to incorporate the aspect of sentiment analysis into traditional stock analysis to enhance stock rating predictions by applying a reliance on the opinion of various stocks from the Internet. Headlines from eight major news publications and conversations from Yahoo! Finance’s “Conversations” feature were parsed through the Valence…

This project aims to incorporate the aspect of sentiment analysis into traditional stock analysis to enhance stock rating predictions by applying a reliance on the opinion of various stocks from the Internet. Headlines from eight major news publications and conversations from Yahoo! Finance’s “Conversations” feature were parsed through the Valence Aware Dictionary for Sentiment Reasoning (VADER) natural language processing package to determine numerical polarities which represented positivity or negativity for a given stock ticker. These generated polarities were paired with stock metrics typically observed by stock analysts as the feature set for a Logistic Regression machine learning model. The model was trained on roughly 1500 major stocks to determine a binary classification between a “Buy” or “Not Buy” rating for each stock, and the results of the model were inserted into the back-end of the Agora Web UI which emulates search engine behavior specifically for stocks found in NYSE and NASDAQ. The model reported an accuracy of 82.5% and for most major stocks, the model’s prediction correlated with stock analysts’ ratings. Given the volatility of the stock market and the propensity for hive-mind behavior in online forums, the performance of the Logistic Regression model would benefit from incorporating historical stock data and more sources of opinion to balance any subjectivity in the model.

ContributorsRamaraju, Venkat (Author) / Rao, Jayanth (Co-author) / Bansal, Ajay (Thesis director) / Smith, James (Committee member) / Barrett, The Honors College (Contributor) / Computer Science and Engineering Program (Contributor)

Created2021-12

Comparative Evaluation of Generative Machine Learning Models for Jazz Improvisation using Numerical Metrics

Description

Standardization is sorely lacking in the field of musical machine learning. This thesis project endeavors to contribute to this standardization by training three machine learning models on the same dataset and comparing them using the same metrics. The music-specific metrics utilized provide more relevant information for diagnosing the shortcomings of…

Standardization is sorely lacking in the field of musical machine learning. This thesis project endeavors to contribute to this standardization by training three machine learning models on the same dataset and comparing them using the same metrics. The music-specific metrics utilized provide more relevant information for diagnosing the shortcomings of each model.

ContributorsHilliker, Jacob (Author) / Li, Baoxin (Thesis director) / Libman, Jeffrey (Committee member) / Barrett, The Honors College (Contributor) / Computer Science and Engineering Program (Contributor)

Created2021-12

Hilliker Thesis Paper

ContributorsHilliker, Jacob (Author) / Li, Baoxin (Thesis director) / Libman, Jeffrey (Committee member) / Barrett, The Honors College (Contributor) / Computer Science and Engineering Program (Contributor)

Created2021-12

Missing Metrics Python Program

ContributorsHilliker, Jacob (Author) / Li, Baoxin (Thesis director) / Libman, Jeffrey (Committee member) / Barrett, The Honors College (Contributor) / Computer Science and Engineering Program (Contributor)

Created2021-12

Agora: Introducing the Internet’s Opinion to Traditional Stock Analysis and Prediction

Description

This project aims to incorporate the aspect of sentiment analysis into traditional stock analysis to enhance stock rating predictions by applying a reliance on the opinion of various stocks from the Internet. Headlines from eight major news publications and conversations from Yahoo! Finance’s “Conversations” feature were parsed through the Valence…

This project aims to incorporate the aspect of sentiment analysis into traditional stock analysis to enhance stock rating predictions by applying a reliance on the opinion of various stocks from the Internet. Headlines from eight major news publications and conversations from Yahoo! Finance’s “Conversations” feature were parsed through the Valence Aware Dictionary for Sentiment Reasoning (VADER) natural language processing package to determine numerical polarities which represented positivity or negativity for a given stock ticker. These generated polarities were paired with stock metrics typically observed by stock analysts as the feature set for a Logistic Regression machine learning model. The model was trained on roughly 1500 major stocks to determine a binary classification between a “Buy” or “Not Buy” rating for each stock, and the results of the model were inserted into the back-end of the Agora Web UI which emulates search engine behavior specifically for stocks found in NYSE and NASDAQ. The model reported an accuracy of 82.5% and for most major stocks, the model’s prediction correlated with stock analysts’ ratings. Given the volatility of the stock market and the propensity for hive-mind behavior in online forums, the performance of the Logistic Regression model would benefit from incorporating historical stock data and more sources of opinion to balance any subjectivity in the model.

ContributorsRao, Jayanth (Author) / Ramaraju, Venkat (Co-author) / Bansal, Ajay (Thesis director) / Smith, James (Committee member) / Barrett, The Honors College (Contributor) / Computer Science and Engineering Program (Contributor) / School of Mathematical and Statistical Sciences (Contributor)

Created2021-12

An Empirical Study of View Construction for Multi-View Learning

Description

Multi-view learning, a subfield of machine learning that aims to improve model performance by training on multiple views of the data, has been studied extensively in the past decades. It is typically applied in contexts where the input features naturally form multiple groups or views. An example of a naturally…

Multi-view learning, a subfield of machine learning that aims to improve model performance by training on multiple views of the data, has been studied extensively in the past decades. It is typically applied in contexts where the input features naturally form multiple groups or views. An example of a naturally multi-view context is a data set of websites, where each website is described not only by the text on the page, but also by the text of hyperlinks pointing to the page. More recently, various studies have demonstrated the initial success of applying multi-view learning on single-view data with multiple artificially constructed views. However, there lacks a systematic study regarding the effectiveness of such artificially constructed views. To bridge this gap, this thesis begins by providing a high-level overview of multi-view learning with the co-training algorithm. Co-training is a classic semi-supervised learning algorithm that takes advantage of both labelled and unlabelled examples in the data set for training. Then, the thesis presents a web-based tool developed in Python allowing users to experiment with and compare the performance of multiple view construction approaches on various data sets. The supported view construction approaches in the web-based tool include subsampling, Optimal Feature Set Partitioning, and the genetic algorithm. Finally, the thesis presents an empirical comparison of the performance of these approaches, not only against one another, but also against traditional single-view models. The findings show that a simple subsampling approach combined with co-training often outperforms both the other view construction approaches, as well as traditional single-view methods.

ContributorsAksoy, Kaan (Author) / Maciejewski, Ross (Thesis director) / He, Jingrui (Committee member) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2019-12

Automated Vulnerability/Adversary Testing Using AI/ML Algorithms

Description

Vulnerability testing/evaluation is a regular task for cyber-security groups. Conducting tasks like this can take up a great amount of time and may not be perfect. Automating these tasks helps speed up the rate at which experts can test systems. However, script based or static programs that run automatically often…

Vulnerability testing/evaluation is a regular task for cyber-security groups. Conducting tasks like this can take up a great amount of time and may not be perfect. Automating these tasks helps speed up the rate at which experts can test systems. However, script based or static programs that run automatically often do not have the versatility required to properly replace human analysis. With the advances in Artificial Intelligence and Machine Learning, a utility can be developed that would allow for the creation of penetration testing plans rather than manually testing vulnerabilities. A variety of existing cyber-security programs and utilities provide an API layer that commonly interacts with the Python environment. With the commonality of AI/ML tools within the Python ecosystem, a plugin like interface can be developed to feed any AI/ML program real world data and receive a response/report in return. Using Python 2.7+, Python 3.6+, pymdptoolbox, and POMDPy, a program was developed that ingests real-world data from scanning tools and returned a suggested course of action to be used by analysts in order to perform a practical validation of the algorithms in a real world setting. This program was able to successfully navigate a test network and produce results that were expected to be found on the target machines without needing human analysis of the network. Using POMDP based systems for more cyber-security type tasks may be a valuable use case for future developments and help ease the burden faced in a rapid paced world.

ContributorsBelanger, Connor Lawrence (Author) / Huang, Dijiang (Thesis director) / Chowdhary, Ankur (Committee member) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2020-05

Improving upon the State-of-the-Art in Multimodal Emotional Recognition in Dialogue

Description

Emotion recognition in conversation has applications within numerous domains such as affective computing and medicine. Recent methods for emotion recognition jointly utilize conversational data over several modalities including audio, video, and text. However, state-of-the-art frameworks for this task do not focus on the feature extraction and feature fusion steps of…

Emotion recognition in conversation has applications within numerous domains such as affective computing and medicine. Recent methods for emotion recognition jointly utilize conversational data over several modalities including audio, video, and text. However, state-of-the-art frameworks for this task do not focus on the feature extraction and feature fusion steps of this process. This thesis aims to improve the state-of-the-art method by incorporating two components to better accomplish these steps. By doing so, we are able to produce improved representations for the text modality and better model the relationships between all modalities. This paper proposes two methods which focus on these concepts and provide improved accuracy over the state-of-the-art framework for multimodal emotion recognition in dialogue.

ContributorsRawal, Siddharth (Author) / Baral, Chitta (Thesis director) / Shah, Shrikant (Committee member) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2020-05

Filtering by