Matching Items (18)

134346-Thumbnail Image.png

On the Application of Malware Clustering for Threat Intelligence Synthesis

Description

Malware forensics is a time-consuming process that involves a significant amount of data collection. To ease the load on security analysts, many attempts have been made to automate the intelligence gathering process and provide a centralized search interface. Certain of

Malware forensics is a time-consuming process that involves a significant amount of data collection. To ease the load on security analysts, many attempts have been made to automate the intelligence gathering process and provide a centralized search interface. Certain of these solutions map existing relations between threats and can discover new intelligence by identifying correlations in the data. However, such systems generally treat each unique malware sample as its own distinct threat. This fails to model the real malware landscape, in which so many ``new" samples are actually variants of samples that have already been discovered. Were there some way to reliably determine whether two malware samples belong to the same family, intelligence for one sample could be applied to any sample in the family, greatly reducing the complexity of intelligence synthesis. Clustering is a common big data approach for grouping data samples which have common features, and has been applied in several recent papers for identifying related malware. It therefore has the potential to be used as described to simplify the intelligence synthesis process. However, existing threat intelligence systems do not use malware clustering. In this paper, we attempt to design a highly accurate malware clustering system, with the ultimate goal of integrating it into a threat intelligence platform. Toward this end, we explore the many considerations of designing such a system: how to extract features to compare malware, and how to use these features for accurate clustering. We then create an experimental clustering system, and evaluate its effectiveness using two different clustering algorithms.

Contributors

Agent

Created

Date Created
2017-05

153926-Thumbnail Image.png

Leveraging collective wisdom in a multilabeled blog categorization environment

Description

One of the most remarkable outcomes resulting from the evolution of the web into Web 2.0, has been the propelling of blogging into a widely adopted and globally accepted phenomenon. While the unprecedented growth of the Blogosphere has added diversity

One of the most remarkable outcomes resulting from the evolution of the web into Web 2.0, has been the propelling of blogging into a widely adopted and globally accepted phenomenon. While the unprecedented growth of the Blogosphere has added diversity and enriched the media, it has also added complexity. To cope with the relentless expansion, many enthusiastic bloggers have embarked on voluntarily writing, tagging, labeling, and cataloguing their posts in hopes of reaching the widest possible audience. Unbeknown to them, this reaching-for-others process triggers the generation of a new kind of collective wisdom, a result of shared collaboration, and the exchange of ideas, purpose, and objectives, through the formation of associations, links, and relations. Mastering an understanding of the Blogosphere can greatly help facilitate the needs of the ever growing number of these users, as well as producers, service providers, and advertisers into facilitation of the categorization and navigation of this vast environment. This work explores a novel method to leverage the collective wisdom from the infused label space for blog search and discovery. The work demonstrates that the wisdom space can provide a most unique and desirable framework to which to discover the highly sought after background information that could aid in the building of classifiers. This work incorporates this insight into the construction of a better clustering of blogs which boosts the performance of classifiers for identifying more relevant labels for blogs, and offers a mechanism that can be incorporated into replacing spurious labels and mislabels in a multi-labeled space.

Contributors

Agent

Created

Date Created
2015

154403-Thumbnail Image.png

Visual analytics for spatiotemporal cluster analysis

Description

Traditionally, visualization is one of the most important and commonly used methods of generating insight into large scale data. Particularly for spatiotemporal data, the translation of such data into a visual form allows users to quickly see patterns, explore summaries

Traditionally, visualization is one of the most important and commonly used methods of generating insight into large scale data. Particularly for spatiotemporal data, the translation of such data into a visual form allows users to quickly see patterns, explore summaries and relate domain knowledge about underlying geographical phenomena that would not be apparent in tabular form. However, several critical challenges arise when visualizing and exploring these large spatiotemporal datasets. While, the underlying geographical component of the data lends itself well to univariate visualization in the form of traditional cartographic representations (e.g., choropleth, isopleth, dasymetric maps), as the data becomes multivariate, cartographic representations become more complex. To simplify the visual representations, analytical methods such as clustering and feature extraction are often applied as part of the classification phase. The automatic classification can then be rendered onto a map; however, one common issue in data classification is that items near a classification boundary are often mislabeled.

This thesis explores methods to augment the automated spatial classification by utilizing interactive machine learning as part of the cluster creation step. First, this thesis explores the design space for spatiotemporal analysis through the development of a comprehensive data wrangling and exploratory data analysis platform. Second, this system is augmented with a novel method for evaluating the visual impact of edge cases for multivariate geographic projections. Finally, system features and functionality are demonstrated through a series of case studies, with key features including similarity analysis, multivariate clustering, and novel visual support for cluster comparison.

Contributors

Agent

Created

Date Created
2016

150264-Thumbnail Image.png

Study of collocated sources of air pollution and the potential for circumventing regulatory major source permitting requirements near Sun City, Arizona

Description

The following research is a regulatory and emissions analysis of collocated sources of air pollution as they relate to the definition of "major, stationary, sources", if their emissions were amalgamated. The emitting sources chosen for this study are seven facilities

The following research is a regulatory and emissions analysis of collocated sources of air pollution as they relate to the definition of "major, stationary, sources", if their emissions were amalgamated. The emitting sources chosen for this study are seven facilities located in a single, aggregate mining pit, along the Aqua Fria riverbed in Sun City, Arizona. The sources in question consist of Rock Crushing and Screening plants, Hot Mix Asphalt plants, and Concrete Batch plants. Generally, individual facilities with emissions of a criteria air pollutant over 100 tons per year or 70 tons per year for PM10 in the Maricopa County non-attainment area would be required to operate under a different permitting regime than those with emissions less than stated above. In addition, facility's that emit over 25 tons per year or 150 pounds per hour of NOx would trigger Maricopa County Best Available Control Technology (BACT) and would be required to install more stringent pollution controls. However, in order to circumvent the more stringent permitting requirements, some facilities have "collocated" in order to escape having their emissions calculated as single source, while operating as a single, production entity. The results of this study indicate that the sources analyzed do not collectively emit major source levels of emissions; however, they do trigger year and daily BACT for NOx. It was also discovered that lack of grid power contributes to the use of generators, which is the main source of emissions. Therefore, if grid electricity was introduced in outlying areas of Maricopa County, facilities could significantly reduce the use of generator power; thereby, reducing pollutants associated with generator use.

Contributors

Agent

Created

Date Created
2011

149901-Thumbnail Image.png

Query expansion for handling exploratory and ambiguous keyword queries

Description

Query Expansion is a functionality of search engines that suggest a set of related queries for a user issued keyword query. In case of exploratory or ambiguous keyword queries, the main goal of the user would be to identify and

Query Expansion is a functionality of search engines that suggest a set of related queries for a user issued keyword query. In case of exploratory or ambiguous keyword queries, the main goal of the user would be to identify and select a specific category of query results among different categorical options, in order to narrow down the search and reach the desired result. Typical corpus-driven keyword query expansion approaches return popular words in the results as expanded queries. These empirical methods fail to cover all semantics of categories present in the query results. More importantly these methods do not consider the semantic relationship between the keywords featured in an expanded query. Contrary to a normal keyword search setting, these factors are non-trivial in an exploratory and ambiguous query setting where the user's precise discernment of different categories present in the query results is more important for making subsequent search decisions. In this thesis, I propose a new framework for keyword query expansion: generating a set of queries that correspond to the categorization of original query results, which is referred as Categorizing query expansion. Two approaches of algorithms are proposed, one that performs clustering as pre-processing step and then generates categorizing expanded queries based on the clusters. The other category of algorithms handle the case of generating quality expanded queries in the presence of imperfect clusters.

Contributors

Agent

Created

Date Created
2011

150214-Thumbnail Image.png

Luminosity function of Lyman-alpha emitters at the reionization epoch: observations & theory

Description

Galaxies with strong Lyman-alpha (Lya) emission line (also called Lya galaxies or emitters) offer an unique probe of the epoch of reionization - one of the important phases when most of the neutral hydrogen in the universe was ionized. In

Galaxies with strong Lyman-alpha (Lya) emission line (also called Lya galaxies or emitters) offer an unique probe of the epoch of reionization - one of the important phases when most of the neutral hydrogen in the universe was ionized. In addition, Lya galaxies at high redshifts are a powerful tool to study low-mass galaxy formation. Since current observations suggest that the reionization is complete by redshift z~ 6, it is therefore necessary to discover galaxies at z > 6, to use their luminosity function (LF) as a probe of reionization. I found five z = 7.7 candidate Lya galaxies with line fluxes > 7x10-18 erg/s/cm/2 , from three different deep near-infrared (IR) narrowband (NB) imaging surveys in a volume > 4x104Mpc3. From the spectroscopic followup of four candidate galaxies, and with the current spectroscopic sensitivity, the detection of only the brightest candidate galaxy can be ruled out at 5 sigma level. Moreover, these observations successfully demonstrate that the sensitivity necessary for both, the NB imaging as well as the spectroscopic followup of z~ 8 Lya galaxies can be reached with the current instrumentation. While future, more sensitive spectroscopic observations are necessary, the observed Lya LF at z = 7.7 is consistent with z = 6.6 LF, suggesting that the intergalactic medium (IGM) is relatively ionized even at z = 7.7, with neutral fraction xHI≤ 30%. On the theoretical front, while several models of Lya emitters have been developed, the physical nature of Lya emitters is not yet completely known. Moreover, multi-parameter models and their complexities necessitates a simpler model. I have developed a simple, single-parameter model to populate dark mater halos with Lya emitters. The central tenet of this model, different from many of the earlier models, is that the star-formation rate (SFR), and hence the Lya luminosity, is proportional to the mass accretion rate rather than the total halo mass. This simple model is successful in reproducing many observable including LFs, stellar masses, SFRs, and clustering of Lya emitters from z~ 3 to z~ 7. Finally, using this model, I find that the mass accretion, and hence the star-formation in > 30% of Lya emitters at z~ 3 occur through major mergers, and this fraction increases to ~ 50% at z~7.

Contributors

Agent

Created

Date Created
2011

150466-Thumbnail Image.png

An analytical approach to lean six sigma deployment strategies: project identification and prioritization

Description

The ever-changing economic landscape has forced many companies to re-examine their supply chains. Global resourcing and outsourcing of processes has been a strategy many organizations have adopted to reduce cost and to increase their global footprint. This has, however, resulted

The ever-changing economic landscape has forced many companies to re-examine their supply chains. Global resourcing and outsourcing of processes has been a strategy many organizations have adopted to reduce cost and to increase their global footprint. This has, however, resulted in increased process complexity and reduced customer satisfaction. In order to meet and exceed customer expectations, many companies are forced to improve quality and on-time delivery, and have looked towards Lean Six Sigma as an approach to enable process improvement. The Lean Six Sigma literature is rich in deployment strategies; however, there is a general lack of a mathematical approach to deploy Lean Six Sigma in a global enterprise. This includes both project identification and prioritization. The research presented here is two-fold. Firstly, a process characterization framework is presented to evaluate processes based on eight characteristics. An unsupervised learning technique, using clustering algorithms, is then utilized to group processes that are Lean Six Sigma conducive. The approach helps Lean Six Sigma deployment champions to identify key areas within the business to focus a Lean Six Sigma deployment. A case study is presented and 33% of the processes were found to be Lean Six Sigma conducive. Secondly, having identified parts of the business that are lean Six Sigma conducive, the next steps are to formulate and prioritize a portfolio of projects. Very often the deployment champion is faced with the decision of selecting a portfolio of Lean Six Sigma projects that meet multiple objectives which could include: maximizing productivity, customer satisfaction or return on investment, while meeting certain budgetary constraints. A multi-period 0-1 knapsack problem is presented that maximizes the expected net savings of the Lean Six Sigma portfolio over the life cycle of the deployment. Finally, a case study is presented that demonstrates the application of the model in a large multinational company. Traditionally, Lean Six Sigma found its roots in manufacturing. The research presented in this dissertation also emphasizes the applicability of the methodology to the non-manufacturing space. Additionally, a comparison is conducted between manufacturing and non-manufacturing processes to highlight the challenges in deploying the methodology in both spaces.

Contributors

Agent

Created

Date Created
2011

150442-Thumbnail Image.png

Clustering of stars in nearby galaxies: probing the range of stellar structures

Description

Most stars form in groups, and these clusters are themselves nestled within larger associations and stellar complexes. It is not yet clear, however, whether stars cluster on preferred size scales within galaxies, or if stellar groupings have a continuous size

Most stars form in groups, and these clusters are themselves nestled within larger associations and stellar complexes. It is not yet clear, however, whether stars cluster on preferred size scales within galaxies, or if stellar groupings have a continuous size distribution. I have developed two methods to select stellar groupings across a wide range of size-scales in order to assess trends in the size distribution and other basic properties of stellar groupings. The first method uses visual inspection of color-magnitude and color-color diagrams of clustered stars to assess whether the compact sources within the potential association are coeval, and thus likely to be born from the same parentmolecular cloud. This method was developed using the stellar associations in the M51/NGC 5195 interacting galaxy system. This process is highly effective at selecting single-aged stellar associations, but in order to assess properties of stellar clustering in a larger sample of nearby galaxies, an automated method for selecting stellar groupings is needed. I have developed an automated stellar grouping selection method that is sensitive to stellar clustering on all size scales. Using the Source Extractor software package on Gaussian-blurred images of NGC 4214, and the annular surface brightness to determine the characteristic size of each cluster/association, I eliminate much of the size and density biases intrinsic to other methods. This automated method was tested in the nearby dwarf irregular galaxy NGC 4214, and can detect stellar groupings with sizes ranging from compact clusters to stellar complexes. In future work, the automatic selection method developed in this dissertation will be used to identify stellar groupings in a set of nearby galaxies to determine if the size scales for stellar clustering are uniform in the nearby universe or if it is dependent on local galactic environment. Once the stellar clusters and associations have been identified and age-dated, this information can be used to deduce disruption times from the age distribution as a function of the position of the stellar grouping within the galaxy, the size of the cluster or association, and the morphological type of the galaxy. The implications of these results for galaxy formation and evolution are discussed.

Contributors

Agent

Created

Date Created
2011

152906-Thumbnail Image.png

TensorDB and tensor-relational model (TRM) for efficient tensor-relational operations

Description

Multidimensional data have various representations. Thanks to their simplicity in modeling multidimensional data and the availability of various mathematical tools (such as tensor decompositions) that support multi-aspect analysis of such data, tensors are increasingly being used in many application domains

Multidimensional data have various representations. Thanks to their simplicity in modeling multidimensional data and the availability of various mathematical tools (such as tensor decompositions) that support multi-aspect analysis of such data, tensors are increasingly being used in many application domains including scientific data management, sensor data management, and social network data analysis. Relational model, on the other hand, enables semantic manipulation of data using relational operators, such as projection, selection, Cartesian-product, and set operators. For many multidimensional data applications, tensor operations as well as relational operations need to be supported throughout the data life cycle. In this thesis, we introduce a tensor-based relational data model (TRM), which enables both tensor- based data analysis and relational manipulations of multidimensional data, and define tensor-relational operations on this model. Then we introduce a tensor-relational data management system, so called, TensorDB. TensorDB is based on TRM, which brings together relational algebraic operations (for data manipulation and integration) and tensor algebraic operations (for data analysis). We develop optimization strategies for tensor-relational operations in both in-memory and in-database TensorDB. The goal of the TRM and TensorDB is to serve as a single environment that supports the entire life cycle of data; that is, data can be manipulated, integrated, processed, and analyzed.

Contributors

Agent

Created

Date Created
2014

153478-Thumbnail Image.png

Controversy analysis: clustering and ranking polarized networks with visualizations

Description

US Senate is the venue of political debates where the federal bills are formed and voted. Senators show their support/opposition along the bills with their votes. This information makes it possible to extract the polarity of the senators. Similarly, blogosphere

US Senate is the venue of political debates where the federal bills are formed and voted. Senators show their support/opposition along the bills with their votes. This information makes it possible to extract the polarity of the senators. Similarly, blogosphere plays an increasingly important role as a forum for public debate. Authors display sentiment toward issues, organizations or people using a natural language.

In this research, given a mixed set of senators/blogs debating on a set of political issues from opposing camps, I use signed bipartite graphs for modeling debates, and I propose an algorithm for partitioning both the opinion holders (senators or blogs) and the issues (bills or topics) comprising the debate into binary opposing camps. Simultaneously, my algorithm scales the entities on a univariate scale. Using this scale, a researcher can identify moderate and extreme senators/blogs within each camp, and polarizing versus unifying issues. Through performance evaluations I show that my proposed algorithm provides an effective solution to the problem, and performs much better than existing baseline algorithms adapted to solve this new problem. In my experiments, I used both real data from political blogosphere and US Congress records, as well as synthetic data which were obtained by varying polarization and degree distribution of the vertices of the graph to show the robustness of my algorithm.

I also applied my algorithm on all the terms of the US Senate to the date for longitudinal analysis and developed a web based interactive user interface www.PartisanScale.com to visualize the analysis.

US politics is most often polarized with respect to the left/right alignment of the entities. However, certain issues do not reflect the polarization due to political parties, but observe a split correlating to the demographics of the senators, or simply receive consensus. I propose a hierarchical clustering algorithm that identifies groups of bills that share the same polarization characteristics. I developed a web based interactive user interface www.ControversyAnalysis.com to visualize the clusters while providing a synopsis through distribution charts, word clouds, and heat maps.

Contributors

Agent

Created

Date Created
2015