Matching Items (18)

134346-Thumbnail Image.png

On the Application of Malware Clustering for Threat Intelligence Synthesis

Description

Malware forensics is a time-consuming process that involves a significant amount of data collection. To ease the load on security analysts, many attempts have been made to automate the intelligence

Malware forensics is a time-consuming process that involves a significant amount of data collection. To ease the load on security analysts, many attempts have been made to automate the intelligence gathering process and provide a centralized search interface. Certain of these solutions map existing relations between threats and can discover new intelligence by identifying correlations in the data. However, such systems generally treat each unique malware sample as its own distinct threat. This fails to model the real malware landscape, in which so many ``new" samples are actually variants of samples that have already been discovered. Were there some way to reliably determine whether two malware samples belong to the same family, intelligence for one sample could be applied to any sample in the family, greatly reducing the complexity of intelligence synthesis. Clustering is a common big data approach for grouping data samples which have common features, and has been applied in several recent papers for identifying related malware. It therefore has the potential to be used as described to simplify the intelligence synthesis process. However, existing threat intelligence systems do not use malware clustering. In this paper, we attempt to design a highly accurate malware clustering system, with the ultimate goal of integrating it into a threat intelligence platform. Toward this end, we explore the many considerations of designing such a system: how to extract features to compare malware, and how to use these features for accurate clustering. We then create an experimental clustering system, and evaluate its effectiveness using two different clustering algorithms.

Contributors

Agent

Created

Date Created
  • 2017-05

158093-Thumbnail Image.png

Embedded Feature Selection for Model-based Clustering

Description

Model-based clustering is a sub-field of statistical modeling and machine learning. The mixture models use the probability to describe the degree of the data point belonging to the cluster, and

Model-based clustering is a sub-field of statistical modeling and machine learning. The mixture models use the probability to describe the degree of the data point belonging to the cluster, and the probability is updated iteratively during the clustering. While mixture models have demonstrated the superior performance in handling noisy data in many fields, there exist some challenges for high dimensional dataset. It is noted that among a large number of features, some may not indeed contribute to delineate the cluster profiles. The inclusion of these “noisy” features will confuse the model to identify the real structure of the clusters and cost more computational time. Recognizing the issue, in this dissertation, I propose a new feature selection algorithm for continuous dataset first and then extend to mixed datatype. Finally, I conduct uncertainty quantification for the feature selection results as the third topic.

The first topic is an embedded feature selection algorithm termed Expectation-Selection-Maximization (ESM) model that can automatically select features while optimizing the parameters for Gaussian Mixture Model. I introduce a relevancy index (RI) revealing the contribution of the feature in the clustering process to assist feature selection. I demonstrate the efficacy of the ESM by studying two synthetic datasets, four benchmark datasets, and an Alzheimer’s Disease dataset.

The second topic focuses on extending the application of ESM algorithm to handle mixed datatypes. The Gaussian mixture model is generalized to Generalized Model of Mixture (GMoM), which can not only handle continuous features, but also binary and nominal features.

The last topic is about Uncertainty Quantification (UQ) of the feature selection. A new algorithm termed ESOM is proposed, which takes the variance information into consideration while conducting feature selection. Also, a set of outliers are generated in the feature selection process to infer the uncertainty in the input data. Finally, the selected features and detected outlier instances are evaluated by visualization comparison.

Contributors

Agent

Created

Date Created
  • 2020

158291-Thumbnail Image.png

Transportation Techniques for Geometric Clustering

Description

This thesis introduces new techniques for clustering distributional data according to their geometric similarities. This work builds upon the optimal transportation (OT) problem that seeks global minimum cost for matching

This thesis introduces new techniques for clustering distributional data according to their geometric similarities. This work builds upon the optimal transportation (OT) problem that seeks global minimum cost for matching distributional data and leverages the connection between OT and power diagrams to solve different clustering problems. The OT formulation is based on the variational principle to differentiate hard cluster assignments, which was missing in the literature. This thesis shows multiple techniques to regularize and generalize OT to cope with various tasks including clustering, aligning, and interpolating distributional data. It also discusses the connections of the new formulation to other OT and clustering formulations to better understand their gaps and the means to close them. Finally, this thesis demonstrates the advantages of the proposed OT techniques in solving machine learning problems and their downstream applications in computer graphics, computer vision, and image processing.

Contributors

Agent

Created

Date Created
  • 2020

150214-Thumbnail Image.png

Luminosity function of Lyman-alpha emitters at the reionization epoch: observations & theory

Description

Galaxies with strong Lyman-alpha (Lya) emission line (also called Lya galaxies or emitters) offer an unique probe of the epoch of reionization - one of the important phases when most

Galaxies with strong Lyman-alpha (Lya) emission line (also called Lya galaxies or emitters) offer an unique probe of the epoch of reionization - one of the important phases when most of the neutral hydrogen in the universe was ionized. In addition, Lya galaxies at high redshifts are a powerful tool to study low-mass galaxy formation. Since current observations suggest that the reionization is complete by redshift z~ 6, it is therefore necessary to discover galaxies at z > 6, to use their luminosity function (LF) as a probe of reionization. I found five z = 7.7 candidate Lya galaxies with line fluxes > 7x10-18 erg/s/cm/2 , from three different deep near-infrared (IR) narrowband (NB) imaging surveys in a volume > 4x104Mpc3. From the spectroscopic followup of four candidate galaxies, and with the current spectroscopic sensitivity, the detection of only the brightest candidate galaxy can be ruled out at 5 sigma level. Moreover, these observations successfully demonstrate that the sensitivity necessary for both, the NB imaging as well as the spectroscopic followup of z~ 8 Lya galaxies can be reached with the current instrumentation. While future, more sensitive spectroscopic observations are necessary, the observed Lya LF at z = 7.7 is consistent with z = 6.6 LF, suggesting that the intergalactic medium (IGM) is relatively ionized even at z = 7.7, with neutral fraction xHI≤ 30%. On the theoretical front, while several models of Lya emitters have been developed, the physical nature of Lya emitters is not yet completely known. Moreover, multi-parameter models and their complexities necessitates a simpler model. I have developed a simple, single-parameter model to populate dark mater halos with Lya emitters. The central tenet of this model, different from many of the earlier models, is that the star-formation rate (SFR), and hence the Lya luminosity, is proportional to the mass accretion rate rather than the total halo mass. This simple model is successful in reproducing many observable including LFs, stellar masses, SFRs, and clustering of Lya emitters from z~ 3 to z~ 7. Finally, using this model, I find that the mass accretion, and hence the star-formation in > 30% of Lya emitters at z~ 3 occur through major mergers, and this fraction increases to ~ 50% at z~7.

Contributors

Agent

Created

Date Created
  • 2011

151587-Thumbnail Image.png

On feature selection stability: a data perspective

Description

The rapid growth in the high-throughput technologies last few decades makes the manual processing of the generated data to be impracticable. Even worse, the machine learning and data mining techniques

The rapid growth in the high-throughput technologies last few decades makes the manual processing of the generated data to be impracticable. Even worse, the machine learning and data mining techniques seemed to be paralyzed against these massive datasets. High-dimensionality is one of the most common challenges for machine learning and data mining tasks. Feature selection aims to reduce dimensionality by selecting a small subset of the features that perform at least as good as the full feature set. Generally, the learning performance, e.g. classification accuracy, and algorithm complexity are used to measure the quality of the algorithm. Recently, the stability of feature selection algorithms has gained an increasing attention as a new indicator due to the necessity to select similar subsets of features each time when the algorithm is run on the same dataset even in the presence of a small amount of perturbation. In order to cure the selection stability issue, we should understand the cause of instability first. In this dissertation, we will investigate the causes of instability in high-dimensional datasets using well-known feature selection algorithms. As a result, we found that the stability mostly data-dependent. According to these findings, we propose a framework to improve selection stability by solving these main causes. In particular, we found that data noise greatly impacts the stability and the learning performance as well. So, we proposed to reduce it in order to improve both selection stability and learning performance. However, current noise reduction approaches are not able to distinguish between data noise and variation in samples from different classes. For this reason, we overcome this limitation by using Supervised noise reduction via Low Rank Matrix Approximation, SLRMA for short. The proposed framework has proved to be successful on different types of datasets with high-dimensionality, such as microarrays and images datasets. However, this framework cannot handle unlabeled, hence, we propose Local SVD to overcome this limitation.

Contributors

Agent

Created

Date Created
  • 2013

151341-Thumbnail Image.png

Spatio-temporal data mining to detect changes and clusters in trajectories

Description

With the rapid development of mobile sensing technologies like GPS, RFID, sensors in smartphones, etc., capturing position data in the form of trajectories has become easy. Moving object trajectory analysis

With the rapid development of mobile sensing technologies like GPS, RFID, sensors in smartphones, etc., capturing position data in the form of trajectories has become easy. Moving object trajectory analysis is a growing area of interest these days owing to its applications in various domains such as marketing, security, traffic monitoring and management, etc. To better understand movement behaviors from the raw mobility data, this doctoral work provides analytic models for analyzing trajectory data. As a first contribution, a model is developed to detect changes in trajectories with time. If the taxis moving in a city are viewed as sensors that provide real time information of the traffic in the city, a change in these trajectories with time can reveal that the road network has changed. To detect changes, trajectories are modeled with a Hidden Markov Model (HMM). A modified training algorithm, for parameter estimation in HMM, called m-BaumWelch, is used to develop likelihood estimates under assumed changes and used to detect changes in trajectory data with time. Data from vehicles are used to test the method for change detection. Secondly, sequential pattern mining is used to develop a model to detect changes in frequent patterns occurring in trajectory data. The aim is to answer two questions: Are the frequent patterns still frequent in the new data? If they are frequent, has the time interval distribution in the pattern changed? Two different approaches are considered for change detection, frequency-based approach and distribution-based approach. The methods are illustrated with vehicle trajectory data. Finally, a model is developed for clustering and outlier detection in semantic trajectories. A challenge with clustering semantic trajectories is that both numeric and categorical attributes are present. Another problem to be addressed while clustering is that trajectories can be of different lengths and also have missing values. A tree-based ensemble is used to address these problems. The approach is extended to outlier detection in semantic trajectories.

Contributors

Agent

Created

Date Created
  • 2012

153926-Thumbnail Image.png

Leveraging collective wisdom in a multilabeled blog categorization environment

Description

One of the most remarkable outcomes resulting from the evolution of the web into Web 2.0, has been the propelling of blogging into a widely adopted and globally accepted phenomenon.

One of the most remarkable outcomes resulting from the evolution of the web into Web 2.0, has been the propelling of blogging into a widely adopted and globally accepted phenomenon. While the unprecedented growth of the Blogosphere has added diversity and enriched the media, it has also added complexity. To cope with the relentless expansion, many enthusiastic bloggers have embarked on voluntarily writing, tagging, labeling, and cataloguing their posts in hopes of reaching the widest possible audience. Unbeknown to them, this reaching-for-others process triggers the generation of a new kind of collective wisdom, a result of shared collaboration, and the exchange of ideas, purpose, and objectives, through the formation of associations, links, and relations. Mastering an understanding of the Blogosphere can greatly help facilitate the needs of the ever growing number of these users, as well as producers, service providers, and advertisers into facilitation of the categorization and navigation of this vast environment. This work explores a novel method to leverage the collective wisdom from the infused label space for blog search and discovery. The work demonstrates that the wisdom space can provide a most unique and desirable framework to which to discover the highly sought after background information that could aid in the building of classifiers. This work incorporates this insight into the construction of a better clustering of blogs which boosts the performance of classifiers for identifying more relevant labels for blogs, and offers a mechanism that can be incorporated into replacing spurious labels and mislabels in a multi-labeled space.

Contributors

Agent

Created

Date Created
  • 2015

152906-Thumbnail Image.png

TensorDB and tensor-relational model (TRM) for efficient tensor-relational operations

Description

Multidimensional data have various representations. Thanks to their simplicity in modeling multidimensional data and the availability of various mathematical tools (such as tensor decompositions) that support multi-aspect analysis of such

Multidimensional data have various representations. Thanks to their simplicity in modeling multidimensional data and the availability of various mathematical tools (such as tensor decompositions) that support multi-aspect analysis of such data, tensors are increasingly being used in many application domains including scientific data management, sensor data management, and social network data analysis. Relational model, on the other hand, enables semantic manipulation of data using relational operators, such as projection, selection, Cartesian-product, and set operators. For many multidimensional data applications, tensor operations as well as relational operations need to be supported throughout the data life cycle. In this thesis, we introduce a tensor-based relational data model (TRM), which enables both tensor- based data analysis and relational manipulations of multidimensional data, and define tensor-relational operations on this model. Then we introduce a tensor-relational data management system, so called, TensorDB. TensorDB is based on TRM, which brings together relational algebraic operations (for data manipulation and integration) and tensor algebraic operations (for data analysis). We develop optimization strategies for tensor-relational operations in both in-memory and in-database TensorDB. The goal of the TRM and TensorDB is to serve as a single environment that supports the entire life cycle of data; that is, data can be manipulated, integrated, processed, and analyzed.

Contributors

Agent

Created

Date Created
  • 2014

Revealing microbial responses to environmental dynamics: developing methods for analysis and visualization of complex sequence datasets

Description

The greatest barrier to understanding how life interacts with its environment is the complexity in which biology operates. In this work, I present experimental designs, analysis methods, and visualization techniques

The greatest barrier to understanding how life interacts with its environment is the complexity in which biology operates. In this work, I present experimental designs, analysis methods, and visualization techniques to overcome the challenges of deciphering complex biological datasets. First, I examine an iron limitation transcriptome of Synechocystis sp. PCC 6803 using a new methodology. Until now, iron limitation in experiments of Synechocystis sp. PCC 6803 gene expression has been achieved through media chelation. Notably, chelation also reduces the bioavailability of other metals, whereas naturally occurring low iron settings likely result from a lack of iron influx and not as a result of chelation. The overall metabolic trends of previous studies are well-characterized but within those trends is significant variability in single gene expression responses. I compare previous transcriptomics analyses with our protocol that limits the addition of bioavailable iron to growth media to identify consistent gene expression signals resulting from iron limitation. Second, I describe a novel method of improving the reliability of centroid-linkage clustering results. The size and complexity of modern sequencing datasets often prohibit constructing distance matrices, which prevents the use of many common clustering algorithms. Centroid-linkage circumvents the need for a distance matrix, but has the adverse effect of producing input-order dependent results. In this chapter, I describe a method of cluster edge counting across iterated centroid-linkage results and reconstructing aggregate clusters from a ranked edge list without a distance matrix and input-order dependence. Finally, I introduce dendritic heat maps, a new figure type that visualizes heat map responses through expanding and contracting sequence clustering specificities. Heat maps are useful for comparing data across a range of possible states. However, data binning is sensitive to clustering cutoffs which are often arbitrarily introduced by researchers and can substantially change the heat map response of any single data point. With an understanding of how the architectural elements of dendrograms and heat maps affect data visualization, I have integrated their salient features to create a figure type aimed at viewing multiple levels of clustering cutoffs, allowing researchers to better understand the effects of environment on metabolism or phylogenetic lineages.

Contributors

Agent

Created

Date Created
  • 2017

156679-Thumbnail Image.png

Machine Learning Models for High-dimensional Biomedical Data

Description

The recent technological advances enable the collection of various complex, heterogeneous and high-dimensional data in biomedical domains. The increasing availability of the high-dimensional biomedical data creates the needs of new

The recent technological advances enable the collection of various complex, heterogeneous and high-dimensional data in biomedical domains. The increasing availability of the high-dimensional biomedical data creates the needs of new machine learning models for effective data analysis and knowledge discovery. This dissertation introduces several unsupervised and supervised methods to help understand the data, discover the patterns and improve the decision making. All the proposed methods can generalize to other industrial fields.

The first topic of this dissertation focuses on the data clustering. Data clustering is often the first step for analyzing a dataset without the label information. Clustering high-dimensional data with mixed categorical and numeric attributes remains a challenging, yet important task. A clustering algorithm based on tree ensembles, CRAFTER, is proposed to tackle this task in a scalable manner.

The second part of this dissertation aims to develop data representation methods for genome sequencing data, a special type of high-dimensional data in the biomedical domain. The proposed data representation method, Bag-of-Segments, can summarize the key characteristics of the genome sequence into a small number of features with good interpretability.

The third part of this dissertation introduces an end-to-end deep neural network model, GCRNN, for time series classification with emphasis on both the accuracy and the interpretation. GCRNN contains a convolutional network component to extract high-level features, and a recurrent network component to enhance the modeling of the temporal characteristics. A feed-forward fully connected network with the sparse group lasso regularization is used to generate the final classification and provide good interpretability.

The last topic centers around the dimensionality reduction methods for time series data. A good dimensionality reduction method is important for the storage, decision making and pattern visualization for time series data. The CRNN autoencoder is proposed to not only achieve low reconstruction error, but also generate discriminative features. A variational version of this autoencoder has great potential for applications such as anomaly detection and process control.

Contributors

Agent

Created

Date Created
  • 2018