Matching Items (11)

135268-Thumbnail Image.png

Malware Analysis and Classification Framework: Detecting Financial Malware Using Machine Learning Techniques

Description

Malware that perform identity theft or steal bank credentials are becoming increasingly common and can cause millions of dollars of damage annually. A large area of research focus is the

Malware that perform identity theft or steal bank credentials are becoming increasingly common and can cause millions of dollars of damage annually. A large area of research focus is the automated detection and removal of such malware, due to their large impact on millions of people each year. Such a detector will be beneficial to any industry that is regularly the target of malware, such as the financial sector. Typical detection approaches such as those found in commercial anti-malware software include signature-based scanning, in which malware executables are identified based on a unique signature or fingerprint developed for that malware. However, as malware authors continue to modify and obfuscate their malware, heuristic detection is increasingly popular, in which the behaviors of the malware are identified and patterns recognized. We explore a malware analysis and classification framework using machine learning to train classifiers to distinguish between malware and benign programs based upon their features and behaviors. Using both decision tree learning and support vector machines as classifier models, we obtained overall classification accuracies of around 80%. Due to limitations primarily including the usage of a small data set, our approach may not be suitable for practical classification of malware and benign programs, as evident by a high error rate.

Contributors

Agent

Created

Date Created
  • 2016-05

154589-Thumbnail Image.png

Predicting demographic and financial attributes in a bank marketing dataset

Description

Bank institutions employ several marketing strategies to maximize new customer acquisition as well as current customer retention. Telemarketing is one such approach taken where individual customers are contacted by

Bank institutions employ several marketing strategies to maximize new customer acquisition as well as current customer retention. Telemarketing is one such approach taken where individual customers are contacted by bank representatives with offers. These telemarketing strategies can be improved in combination with data mining techniques that allow predictability of customer information and interests. In this thesis, bank telemarketing data from a Portuguese banking institution were analyzed to determine predictability of several client demographic and financial attributes and find most contributing factors in each. Data were preprocessed to ensure quality, and then data mining models were generated for the attributes with logistic regression, support vector machine (SVM) and random forest using Orange as the data mining tool. Results were analyzed using precision, recall and F1 score.

Contributors

Agent

Created

Date Created
  • 2016

153065-Thumbnail Image.png

A model fusion based framework for imbalanced classification problem with noisy dataset

Description

Data imbalance and data noise often coexist in real world datasets. Data imbalance affects the learning classifier by degrading the recognition power of the classifier on the minority class, while

Data imbalance and data noise often coexist in real world datasets. Data imbalance affects the learning classifier by degrading the recognition power of the classifier on the minority class, while data noise affects the learning classifier by providing inaccurate information and thus misleads the classifier. Because of these differences, data imbalance and data noise have been treated separately in the data mining field. Yet, such approach ignores the mutual effects and as a result may lead to new problems. A desirable solution is to tackle these two issues jointly. Noting the complementary nature of generative and discriminative models, this research proposes a unified model fusion based framework to handle the imbalanced classification with noisy dataset.

The phase I study focuses on the imbalanced classification problem. A generative classifier, Gaussian Mixture Model (GMM) is studied which can learn the distribution of the imbalance data to improve the discrimination power on imbalanced classes. By fusing this knowledge into cost SVM (cSVM), a CSG method is proposed. Experimental results show the effectiveness of CSG in dealing with imbalanced classification problems.

The phase II study expands the research scope to include the noisy dataset into the imbalanced classification problem. A model fusion based framework, K Nearest Gaussian (KNG) is proposed. KNG employs a generative modeling method, GMM, to model the training data as Gaussian mixtures and form adjustable confidence regions which are less sensitive to data imbalance and noise. Motivated by the K-nearest neighbor algorithm, the neighboring Gaussians are used to classify the testing instances. Experimental results show KNG method greatly outperforms traditional classification methods in dealing with imbalanced classification problems with noisy dataset.

The phase III study addresses the issues of feature selection and parameter tuning of KNG algorithm. To further improve the performance of KNG algorithm, a Particle Swarm Optimization based method (PSO-KNG) is proposed. PSO-KNG formulates model parameters and data features into the same particle vector and thus can search the best feature and parameter combination jointly. The experimental results show that PSO can greatly improve the performance of KNG with better accuracy and much lower computational cost.

Contributors

Agent

Created

Date Created
  • 2014

153889-Thumbnail Image.png

Comparison of feature selection methods for robust dexterous decoding of finger movements from the primary motor cortex of a non-human primate using support vector machine

Description

Robust and stable decoding of neural signals is imperative for implementing a useful neuroprosthesis capable of carrying out dexterous tasks. A nonhuman primate (NHP) was trained to perform combined flexions

Robust and stable decoding of neural signals is imperative for implementing a useful neuroprosthesis capable of carrying out dexterous tasks. A nonhuman primate (NHP) was trained to perform combined flexions of the thumb, index and middle fingers in addition to individual flexions and extensions of the same digits. An array of microelectrodes was implanted in the hand area of the motor cortex of the NHP and used to record action potentials during finger movements. A Support Vector Machine (SVM) was used to classify which finger movement the NHP was making based upon action potential firing rates. The effect of four feature selection techniques, Wilcoxon signed-rank test, Relative Importance, Principal Component Analysis, and Mutual Information Maximization was compared based on SVM classification performance. SVM classification was used to examine the functional parameters of (i) efficacy (ii) endurance to simulated failure and (iii) longevity of classification. The effect of using isolated-neuron and multi-unit firing rates was compared as the feature vector supplied to the SVM. The best classification performance was on post-implantation day 36, when using multi-unit firing rates the worst classification accuracy resulted from features selected with Wilcoxon signed-rank test (51.12 ± 0.65%) and the best classification accuracy resulted from Mutual Information Maximization (93.74 ± 0.32%). On this day when using single-unit firing rates, the classification accuracy from the Wilcoxon signed-rank test was 88.85 ± 0.61 % and Mutual Information Maximization was 95.60 ± 0.52% (degrees of freedom =10, level of chance =10%)

Contributors

Agent

Created

Date Created
  • 2015

153557-Thumbnail Image.png

Analyzing the effects of Bollinger bands on the probability of stock options using support vector machines

Description

The purpose of this research is to efficiently analyze certain data provided and to see if a useful trend can be observed as a result. This trend can be used

The purpose of this research is to efficiently analyze certain data provided and to see if a useful trend can be observed as a result. This trend can be used to analyze certain probabilities. There are three main pieces of data which are being analyzed in this research: The value for δ of the call and put option, the %B value of the stock, and the amount of time until expiration of the stock option. The %B value is the most important. The purpose of analyzing the data is to see the relationship between the variables and, given certain values, what is the probability the trade makes money. This result will be used in finding the probability certain trades make money over a period of time.

Since options are so dependent on probability, this research specifically analyzes stock options rather than stocks themselves. Stock options have value like stocks except options are leveraged. The most common model used to calculate the value of an option is the Black-Scholes Model [1]. There are five main variables the Black-Scholes Model uses to calculate the overall value of an option. These variables are θ, δ, γ, v, and ρ. The variable, θ is the rate of change in price of the option due to time decay, δ is the rate of change of the option’s price due to the stock’s changing value, γ is the rate of change of δ, v represents the rate of change of the value of the option in relation to the stock’s volatility, and ρ represents the rate of change in value of the option in relation to the interest rate [2]. In this research, the %B value of the stock is analyzed along with the time until expiration of the option. All options have the same δ. This is due to the fact that all the options analyzed in this experiment are less than two months from expiration and the value of δ reveals how far in or out of the money an option is.

The machine learning technique used to analyze the data and the probability

is support vector machines. Support vector machines analyze data that can be classified in one of two or more groups and attempts to find a pattern in the data to develop a model, which reliably classifies similar, future data into the correct group. This is used to analyze the outcome of stock options.

Contributors

Agent

Created

Date Created
  • 2015

153567-Thumbnail Image.png

A new machine learning based approach to NASA's propulsion engine diagnostic benchmark problem

Description

Gas turbine engine for aircraft propulsion represents one of the most physics-complex and safety-critical systems in the world. Its failure diagnostic is challenging due to the complexity of the model

Gas turbine engine for aircraft propulsion represents one of the most physics-complex and safety-critical systems in the world. Its failure diagnostic is challenging due to the complexity of the model system, difficulty involved in practical testing and the infeasibility of creating homogeneous diagnostic performance evaluation criteria for the diverse engine makes.

NASA has designed and publicized a standard benchmark problem for propulsion engine gas path diagnostic that enables comparisons among different engine diagnostic approaches. Some traditional model-based approaches and novel purely data-driven approaches such as machine learning, have been applied to this problem.

This study focuses on a different machine learning approach to the diagnostic problem. Some most common machine learning techniques, such as support vector machine, multi-layer perceptron, and self-organizing map are used to help gain insight into the different engine failure modes from the perspective of big data. They are organically integrated to achieve good performance based on a good understanding of the complex dataset.

The study presents a new hierarchical machine learning structure to enhance classification accuracy in NASA's engine diagnostic benchmark problem. The designed hierarchical structure produces an average diagnostic accuracy of 73.6%, which outperforms comparable studies that were most recently published.

Contributors

Agent

Created

Date Created
  • 2015

151627-Thumbnail Image.png

A semantic triplet based story classifier

Description

Text classification, in the artificial intelligence domain, is an activity in which text documents are automatically classified into predefined categories using machine learning techniques. An example of this is classifying

Text classification, in the artificial intelligence domain, is an activity in which text documents are automatically classified into predefined categories using machine learning techniques. An example of this is classifying uncategorized news articles into different predefined categories such as "Business", "Politics", "Education", "Technology" , etc. In this thesis, supervised machine learning approach is followed, in which a module is first trained with pre-classified training data and then class of test data is predicted. Good feature extraction is an important step in the machine learning approach and hence the main component of this text classifier is semantic triplet based features in addition to traditional features like standard keyword based features and statistical features based on shallow-parsing (such as density of POS tags and named entities). Triplet {Subject, Verb, Object} in a sentence is defined as a relation between subject and object, the relation being the predicate (verb). Triplet extraction process, is a 5 step process which takes input corpus as a web text document(s), each consisting of one or many paragraphs, from RSS feeds to lists of extremist website. Input corpus feeds into the "Pronoun Resolution" step, which uses an heuristic approach to identify the noun phrases referenced by the pronouns. The next step "SRL Parser" is a shallow semantic parser and converts the incoming pronoun resolved paragraphs into annotated predicate argument format. The output of SRL parser is processed by "Triplet Extractor" algorithm which forms the triplet in the form {Subject, Verb, Object}. Generalization and reduction of triplet features is the next step. Reduced feature representation reduces computing time, yields better discriminatory behavior and handles curse of dimensionality phenomena. For training and testing, a ten- fold cross validation approach is followed. In each round SVM classifier is trained with 90% of labeled (training) data and in the testing phase, classes of remaining 10% unlabeled (testing) data are predicted. Concluding, this paper proposes a model with semantic triplet based features for story classification. The effectiveness of the model is demonstrated against other traditional features used in the literature for text classification tasks.

Contributors

Agent

Created

Date Created
  • 2013

152691-Thumbnail Image.png

Exploration of neural coding in rat's agranular medial and agranular lateral cortices during learning of a directional choice task

Description

Animals learn to choose a proper action among alternatives according to the circumstance. Through trial-and-error, animals improve their odds by making correct association between their behavioral choices and external stimuli.

Animals learn to choose a proper action among alternatives according to the circumstance. Through trial-and-error, animals improve their odds by making correct association between their behavioral choices and external stimuli. While there has been an extensive literature on the theory of learning, it is still unclear how individual neurons and a neural network adapt as learning progresses. In this dissertation, single units in the medial and lateral agranular (AGm and AGl) cortices were recorded as rats learned a directional choice task. The task required the rat to make a left/right side lever press if a light cue appeared on the left/right side of the interface panel. Behavior analysis showed that rat's movement parameters during performance of directional choices became stereotyped very quickly (2-3 days) while learning to solve the directional choice problem took weeks to occur. The entire learning process was further broken down to 3 stages, each having similar number of recording sessions (days). Single unit based firing rate analysis revealed that 1) directional rate modulation was observed in both cortices; 2) the averaged mean rate between left and right trials in the neural ensemble each day did not change significantly among the three learning stages; 3) the rate difference between left and right trials of the ensemble did not change significantly either. Besides, for either left or right trials, the trial-to-trial firing variability of single neurons did not change significantly over the three stages. To explore the spatiotemporal neural pattern of the recorded ensemble, support vector machines (SVMs) were constructed each day to decode the direction of choice in single trials. Improved classification accuracy indicated enhanced discriminability between neural patterns of left and right choices as learning progressed. When using a restricted Boltzmann machine (RBM) model to extract features from neural activity patterns, results further supported the idea that neural firing patterns adapted during the three learning stages to facilitate the neural codes of directional choices. Put together, these findings suggest a spatiotemporal neural coding scheme in a rat AGl and AGm neural ensemble that may be responsible for and contributing to learning the directional choice task.

Contributors

Agent

Created

Date Created
  • 2014

151511-Thumbnail Image.png

Learning from asymmetric models and matched pairs

Description

With the increase in computing power and availability of data, there has never been a greater need to understand data and make decisions from it. Traditional statistical techniques may not

With the increase in computing power and availability of data, there has never been a greater need to understand data and make decisions from it. Traditional statistical techniques may not be adequate to handle the size of today's data or the complexities of the information hidden within the data. Thus knowledge discovery by machine learning techniques is necessary if we want to better understand information from data. In this dissertation, we explore the topics of asymmetric loss and asymmetric data in machine learning and propose new algorithms as solutions to some of the problems in these topics. We also studied variable selection of matched data sets and proposed a solution when there is non-linearity in the matched data. The research is divided into three parts. The first part addresses the problem of asymmetric loss. A proposed asymmetric support vector machine (aSVM) is used to predict specific classes with high accuracy. aSVM was shown to produce higher precision than a regular SVM. The second part addresses asymmetric data sets where variables are only predictive for a subset of the predictor classes. Asymmetric Random Forest (ARF) was proposed to detect these kinds of variables. The third part explores variable selection for matched data sets. Matched Random Forest (MRF) was proposed to find variables that are able to distinguish case and control without the restrictions that exists in linear models. MRF detects variables that are able to distinguish case and control even in the presence of interaction and qualitative variables.

Contributors

Agent

Created

Date Created
  • 2013

149993-Thumbnail Image.png

The detection of reliability prediction cues in manufacturing data from statistically controlled processes

Description

Many products undergo several stages of testing ranging from tests on individual components to end-item tests. Additionally, these products may be further "tested" via customer or field use. The later

Many products undergo several stages of testing ranging from tests on individual components to end-item tests. Additionally, these products may be further "tested" via customer or field use. The later failure of a delivered product may in some cases be due to circumstances that have no correlation with the product's inherent quality. However, at times, there may be cues in the upstream test data that, if detected, could serve to predict the likelihood of downstream failure or performance degradation induced by product use or environmental stresses. This study explores the use of downstream factory test data or product field reliability data to infer data mining or pattern recognition criteria onto manufacturing process or upstream test data by means of support vector machines (SVM) in order to provide reliability prediction models. In concert with a risk/benefit analysis, these models can be utilized to drive improvement of the product or, at least, via screening to improve the reliability of the product delivered to the customer. Such models can be used to aid in reliability risk assessment based on detectable correlations between the product test performance and the sources of supply, test stands, or other factors related to product manufacture. As an enhancement to the usefulness of the SVM or hyperplane classifier within this context, L-moments and the Western Electric Company (WECO) Rules are used to augment or replace the native process or test data used as inputs to the classifier. As part of this research, a generalizable binary classification methodology was developed that can be used to design and implement predictors of end-item field failure or downstream product performance based on upstream test data that may be composed of single-parameter, time-series, or multivariate real-valued data. Additionally, the methodology provides input parameter weighting factors that have proved useful in failure analysis and root cause investigations as indicators of which of several upstream product parameters have the greater influence on the downstream failure outcomes.

Contributors

Agent

Created

Date Created
  • 2011