Search Content

Matching Items (2)

Filtering by

Creators: Barrett, The Honors College

Malware Analysis and Classification Framework: Detecting Financial Malware Using Machine Learning Techniques

Description

Malware that perform identity theft or steal bank credentials are becoming increasingly common and can cause millions of dollars of damage annually. A large area of research focus is the automated detection and removal of such malware, due to their large impact on millions of people each year. Such a detector will be beneficial to any industry that is regularly the target of malware, such as the financial sector. Typical detection approaches such as those found in commercial anti-malware software include signature-based scanning, in which malware executables are identified based on a unique signature or fingerprint developed for that malware. However, as malware authors continue to modify and obfuscate their malware, heuristic detection is increasingly popular, in which the behaviors of the malware are identified and patterns recognized. We explore a malware analysis and classification framework using machine learning to train classifiers to distinguish between malware and benign programs based upon their features and behaviors. Using both decision tree learning and support vector machines as classifier models, we obtained overall classification accuracies of around 80%. Due to limitations primarily including the usage of a small data set, our approach may not be suitable for practical classification of malware and benign programs, as evident by a high error rate.

ContributorsAnwar, Sajid (Co-author) / Chan, Tsz (Co-author) / Ahn, Gail-Joon (Thesis director) / Zhao, Ziming (Committee member) / Computer Science and Engineering Program (Contributor) / Barrett, The Honors College (Contributor)

Created2016-05

A model fusion based framework for imbalanced classification problem with noisy dataset

Description

Data imbalance and data noise often coexist in real world datasets. Data imbalance affects the learning classifier by degrading the recognition power of the classifier on the minority class, while data noise affects the learning classifier by providing inaccurate information and thus misleads the classifier. Because of these differences, data imbalance and data noise have been treated separately in the data mining field. Yet, such approach ignores the mutual effects and as a result may lead to new problems. A desirable solution is to tackle these two issues jointly. Noting the complementary nature of generative and discriminative models, this research proposes a unified model fusion based framework to handle the imbalanced classification with noisy dataset.

The phase I study focuses on the imbalanced classification problem. A generative classifier, Gaussian Mixture Model (GMM) is studied which can learn the distribution of the imbalance data to improve the discrimination power on imbalanced classes. By fusing this knowledge into cost SVM (cSVM), a CSG method is proposed. Experimental results show the effectiveness of CSG in dealing with imbalanced classification problems.

The phase II study expands the research scope to include the noisy dataset into the imbalanced classification problem. A model fusion based framework, K Nearest Gaussian (KNG) is proposed. KNG employs a generative modeling method, GMM, to model the training data as Gaussian mixtures and form adjustable confidence regions which are less sensitive to data imbalance and noise. Motivated by the K-nearest neighbor algorithm, the neighboring Gaussians are used to classify the testing instances. Experimental results show KNG method greatly outperforms traditional classification methods in dealing with imbalanced classification problems with noisy dataset.

The phase III study addresses the issues of feature selection and parameter tuning of KNG algorithm. To further improve the performance of KNG algorithm, a Particle Swarm Optimization based method (PSO-KNG) is proposed. PSO-KNG formulates model parameters and data features into the same particle vector and thus can search the best feature and parameter combination jointly. The experimental results show that PSO can greatly improve the performance of KNG with better accuracy and much lower computational cost.

ContributorsHe, Miao (Author) / Wu, Teresa (Thesis advisor) / Li, Jing (Committee member) / Silva, Alvin (Committee member) / Borror, Connie (Committee member) / Arizona State University (Publisher)

Created2014