Search Content

Query expansion for handling exploratory and ambiguous keyword queries

Description

Query Expansion is a functionality of search engines that suggest a set of related queries for a user issued keyword query. In case of exploratory or ambiguous keyword queries, the main goal of the user would be to identify and select a specific category of query results among different categorical…

Query Expansion is a functionality of search engines that suggest a set of related queries for a user issued keyword query. In case of exploratory or ambiguous keyword queries, the main goal of the user would be to identify and select a specific category of query results among different categorical options, in order to narrow down the search and reach the desired result. Typical corpus-driven keyword query expansion approaches return popular words in the results as expanded queries. These empirical methods fail to cover all semantics of categories present in the query results. More importantly these methods do not consider the semantic relationship between the keywords featured in an expanded query. Contrary to a normal keyword search setting, these factors are non-trivial in an exploratory and ambiguous query setting where the user's precise discernment of different categories present in the query results is more important for making subsequent search decisions. In this thesis, I propose a new framework for keyword query expansion: generating a set of queries that correspond to the categorization of original query results, which is referred as Categorizing query expansion. Two approaches of algorithms are proposed, one that performs clustering as pre-processing step and then generates categorizing expanded queries based on the clusters. The other category of algorithms handle the case of generating quality expanded queries in the presence of imperfect clusters.

ContributorsNatarajan, Sivaramakrishnan (Author) / Chen, Yi (Thesis advisor) / Candan, Selcuk (Committee member) / Sen, Arunabha (Committee member) / Arizona State University (Publisher)

Created2011

Luminosity function of Lyman-alpha emitters at the reionization epoch: observations & theory

Description

Galaxies with strong Lyman-alpha (Lya) emission line (also called Lya galaxies or emitters) offer an unique probe of the epoch of reionization - one of the important phases when most of the neutral hydrogen in the universe was ionized. In addition, Lya galaxies at high redshifts are a powerful tool…

Galaxies with strong Lyman-alpha (Lya) emission line (also called Lya galaxies or emitters) offer an unique probe of the epoch of reionization - one of the important phases when most of the neutral hydrogen in the universe was ionized. In addition, Lya galaxies at high redshifts are a powerful tool to study low-mass galaxy formation. Since current observations suggest that the reionization is complete by redshift z~ 6, it is therefore necessary to discover galaxies at z > 6, to use their luminosity function (LF) as a probe of reionization. I found five z = 7.7 candidate Lya galaxies with line fluxes > 7x10-18 erg/s/cm/2 , from three different deep near-infrared (IR) narrowband (NB) imaging surveys in a volume > 4x104Mpc3. From the spectroscopic followup of four candidate galaxies, and with the current spectroscopic sensitivity, the detection of only the brightest candidate galaxy can be ruled out at 5 sigma level. Moreover, these observations successfully demonstrate that the sensitivity necessary for both, the NB imaging as well as the spectroscopic followup of z~ 8 Lya galaxies can be reached with the current instrumentation. While future, more sensitive spectroscopic observations are necessary, the observed Lya LF at z = 7.7 is consistent with z = 6.6 LF, suggesting that the intergalactic medium (IGM) is relatively ionized even at z = 7.7, with neutral fraction xHI≤ 30%. On the theoretical front, while several models of Lya emitters have been developed, the physical nature of Lya emitters is not yet completely known. Moreover, multi-parameter models and their complexities necessitates a simpler model. I have developed a simple, single-parameter model to populate dark mater halos with Lya emitters. The central tenet of this model, different from many of the earlier models, is that the star-formation rate (SFR), and hence the Lya luminosity, is proportional to the mass accretion rate rather than the total halo mass. This simple model is successful in reproducing many observable including LFs, stellar masses, SFRs, and clustering of Lya emitters from z~ 3 to z~ 7. Finally, using this model, I find that the mass accretion, and hence the star-formation in > 30% of Lya emitters at z~ 3 occur through major mergers, and this fraction increases to ~ 50% at z~7.

ContributorsShet Tilvi, Vithal (Author) / Malhotra, Sangeeta (Thesis advisor) / Rhoads, James (Committee member) / Scannapieco, Evan (Committee member) / Young, Patrick (Committee member) / Jansen, Rolf (Committee member) / Arizona State University (Publisher)

Created2011

Study of collocated sources of air pollution and the potential for circumventing regulatory major source permitting requirements near Sun City, Arizona

Description

The following research is a regulatory and emissions analysis of collocated sources of air pollution as they relate to the definition of "major, stationary, sources", if their emissions were amalgamated. The emitting sources chosen for this study are seven facilities located in a single, aggregate mining pit, along the Aqua…

The following research is a regulatory and emissions analysis of collocated sources of air pollution as they relate to the definition of "major, stationary, sources", if their emissions were amalgamated. The emitting sources chosen for this study are seven facilities located in a single, aggregate mining pit, along the Aqua Fria riverbed in Sun City, Arizona. The sources in question consist of Rock Crushing and Screening plants, Hot Mix Asphalt plants, and Concrete Batch plants. Generally, individual facilities with emissions of a criteria air pollutant over 100 tons per year or 70 tons per year for PM10 in the Maricopa County non-attainment area would be required to operate under a different permitting regime than those with emissions less than stated above. In addition, facility's that emit over 25 tons per year or 150 pounds per hour of NOx would trigger Maricopa County Best Available Control Technology (BACT) and would be required to install more stringent pollution controls. However, in order to circumvent the more stringent permitting requirements, some facilities have "collocated" in order to escape having their emissions calculated as single source, while operating as a single, production entity. The results of this study indicate that the sources analyzed do not collectively emit major source levels of emissions; however, they do trigger year and daily BACT for NOx. It was also discovered that lack of grid power contributes to the use of generators, which is the main source of emissions. Therefore, if grid electricity was introduced in outlying areas of Maricopa County, facilities could significantly reduce the use of generator power; thereby, reducing pollutants associated with generator use.

ContributorsFranquist, Timothy S (Author) / Olson, Larry (Thesis advisor) / Hild, Nicholas (Committee member) / Brown, Albert (Committee member) / Arizona State University (Publisher)

Created2011

Spatio-temporal data mining to detect changes and clusters in trajectories

Description

With the rapid development of mobile sensing technologies like GPS, RFID, sensors in smartphones, etc., capturing position data in the form of trajectories has become easy. Moving object trajectory analysis is a growing area of interest these days owing to its applications in various domains such as marketing, security, traffic…

With the rapid development of mobile sensing technologies like GPS, RFID, sensors in smartphones, etc., capturing position data in the form of trajectories has become easy. Moving object trajectory analysis is a growing area of interest these days owing to its applications in various domains such as marketing, security, traffic monitoring and management, etc. To better understand movement behaviors from the raw mobility data, this doctoral work provides analytic models for analyzing trajectory data. As a first contribution, a model is developed to detect changes in trajectories with time. If the taxis moving in a city are viewed as sensors that provide real time information of the traffic in the city, a change in these trajectories with time can reveal that the road network has changed. To detect changes, trajectories are modeled with a Hidden Markov Model (HMM). A modified training algorithm, for parameter estimation in HMM, called m-BaumWelch, is used to develop likelihood estimates under assumed changes and used to detect changes in trajectory data with time. Data from vehicles are used to test the method for change detection. Secondly, sequential pattern mining is used to develop a model to detect changes in frequent patterns occurring in trajectory data. The aim is to answer two questions: Are the frequent patterns still frequent in the new data? If they are frequent, has the time interval distribution in the pattern changed? Two different approaches are considered for change detection, frequency-based approach and distribution-based approach. The methods are illustrated with vehicle trajectory data. Finally, a model is developed for clustering and outlier detection in semantic trajectories. A challenge with clustering semantic trajectories is that both numeric and categorical attributes are present. Another problem to be addressed while clustering is that trajectories can be of different lengths and also have missing values. A tree-based ensemble is used to address these problems. The approach is extended to outlier detection in semantic trajectories.

ContributorsKondaveeti, Anirudh (Author) / Runger, George C. (Thesis advisor) / Mirchandani, Pitu (Committee member) / Pan, Rong (Committee member) / Maciejewski, Ross (Committee member) / Arizona State University (Publisher)

Created2012

On feature selection stability: a data perspective

Description

The rapid growth in the high-throughput technologies last few decades makes the manual processing of the generated data to be impracticable. Even worse, the machine learning and data mining techniques seemed to be paralyzed against these massive datasets. High-dimensionality is one of the most common challenges for machine learning and…

The rapid growth in the high-throughput technologies last few decades makes the manual processing of the generated data to be impracticable. Even worse, the machine learning and data mining techniques seemed to be paralyzed against these massive datasets. High-dimensionality is one of the most common challenges for machine learning and data mining tasks. Feature selection aims to reduce dimensionality by selecting a small subset of the features that perform at least as good as the full feature set. Generally, the learning performance, e.g. classification accuracy, and algorithm complexity are used to measure the quality of the algorithm. Recently, the stability of feature selection algorithms has gained an increasing attention as a new indicator due to the necessity to select similar subsets of features each time when the algorithm is run on the same dataset even in the presence of a small amount of perturbation. In order to cure the selection stability issue, we should understand the cause of instability first. In this dissertation, we will investigate the causes of instability in high-dimensional datasets using well-known feature selection algorithms. As a result, we found that the stability mostly data-dependent. According to these findings, we propose a framework to improve selection stability by solving these main causes. In particular, we found that data noise greatly impacts the stability and the learning performance as well. So, we proposed to reduce it in order to improve both selection stability and learning performance. However, current noise reduction approaches are not able to distinguish between data noise and variation in samples from different classes. For this reason, we overcome this limitation by using Supervised noise reduction via Low Rank Matrix Approximation, SLRMA for short. The proposed framework has proved to be successful on different types of datasets with high-dimensionality, such as microarrays and images datasets. However, this framework cannot handle unlabeled, hence, we propose Local SVD to overcome this limitation.

ContributorsAlelyani, Salem (Author) / Liu, Huan (Thesis advisor) / Xue, Guoliang (Committee member) / Ye, Jieping (Committee member) / Zhao, Zheng (Committee member) / Arizona State University (Publisher)

Created2013

Clustering of stars in nearby galaxies: probing the range of stellar structures

Description

Most stars form in groups, and these clusters are themselves nestled within larger associations and stellar complexes. It is not yet clear, however, whether stars cluster on preferred size scales within galaxies, or if stellar groupings have a continuous size distribution. I have developed two methods to select stellar groupings…

Most stars form in groups, and these clusters are themselves nestled within larger associations and stellar complexes. It is not yet clear, however, whether stars cluster on preferred size scales within galaxies, or if stellar groupings have a continuous size distribution. I have developed two methods to select stellar groupings across a wide range of size-scales in order to assess trends in the size distribution and other basic properties of stellar groupings. The first method uses visual inspection of color-magnitude and color-color diagrams of clustered stars to assess whether the compact sources within the potential association are coeval, and thus likely to be born from the same parentmolecular cloud. This method was developed using the stellar associations in the M51/NGC 5195 interacting galaxy system. This process is highly effective at selecting single-aged stellar associations, but in order to assess properties of stellar clustering in a larger sample of nearby galaxies, an automated method for selecting stellar groupings is needed. I have developed an automated stellar grouping selection method that is sensitive to stellar clustering on all size scales. Using the Source Extractor software package on Gaussian-blurred images of NGC 4214, and the annular surface brightness to determine the characteristic size of each cluster/association, I eliminate much of the size and density biases intrinsic to other methods. This automated method was tested in the nearby dwarf irregular galaxy NGC 4214, and can detect stellar groupings with sizes ranging from compact clusters to stellar complexes. In future work, the automatic selection method developed in this dissertation will be used to identify stellar groupings in a set of nearby galaxies to determine if the size scales for stellar clustering are uniform in the nearby universe or if it is dependent on local galactic environment. Once the stellar clusters and associations have been identified and age-dated, this information can be used to deduce disruption times from the age distribution as a function of the position of the stellar grouping within the galaxy, the size of the cluster or association, and the morphological type of the galaxy. The implications of these results for galaxy formation and evolution are discussed.

ContributorsKaleida, Catherine (Author) / Scowen, Paul A. (Thesis advisor) / Windhorst, Rogier A. (Thesis advisor) / Jansen, Rolf A. (Committee member) / Timmes, Francis X. (Committee member) / Scannapieco, Evan (Committee member) / Arizona State University (Publisher)

Created2011

An analytical approach to lean six sigma deployment strategies: project identification and prioritization

Description

The ever-changing economic landscape has forced many companies to re-examine their supply chains. Global resourcing and outsourcing of processes has been a strategy many organizations have adopted to reduce cost and to increase their global footprint. This has, however, resulted in increased process complexity and reduced customer satisfaction. In order…

The ever-changing economic landscape has forced many companies to re-examine their supply chains. Global resourcing and outsourcing of processes has been a strategy many organizations have adopted to reduce cost and to increase their global footprint. This has, however, resulted in increased process complexity and reduced customer satisfaction. In order to meet and exceed customer expectations, many companies are forced to improve quality and on-time delivery, and have looked towards Lean Six Sigma as an approach to enable process improvement. The Lean Six Sigma literature is rich in deployment strategies; however, there is a general lack of a mathematical approach to deploy Lean Six Sigma in a global enterprise. This includes both project identification and prioritization. The research presented here is two-fold. Firstly, a process characterization framework is presented to evaluate processes based on eight characteristics. An unsupervised learning technique, using clustering algorithms, is then utilized to group processes that are Lean Six Sigma conducive. The approach helps Lean Six Sigma deployment champions to identify key areas within the business to focus a Lean Six Sigma deployment. A case study is presented and 33% of the processes were found to be Lean Six Sigma conducive. Secondly, having identified parts of the business that are lean Six Sigma conducive, the next steps are to formulate and prioritize a portfolio of projects. Very often the deployment champion is faced with the decision of selecting a portfolio of Lean Six Sigma projects that meet multiple objectives which could include: maximizing productivity, customer satisfaction or return on investment, while meeting certain budgetary constraints. A multi-period 0-1 knapsack problem is presented that maximizes the expected net savings of the Lean Six Sigma portfolio over the life cycle of the deployment. Finally, a case study is presented that demonstrates the application of the model in a large multinational company. Traditionally, Lean Six Sigma found its roots in manufacturing. The research presented in this dissertation also emphasizes the applicability of the methodology to the non-manufacturing space. Additionally, a comparison is conducted between manufacturing and non-manufacturing processes to highlight the challenges in deploying the methodology in both spaces.

ContributorsDuarte, Brett Marc (Author) / Fowler, John W (Thesis advisor) / Montgomery, Douglas C. (Thesis advisor) / Shunk, Dan (Committee member) / Borror, Connie (Committee member) / Konopka, John (Committee member) / Arizona State University (Publisher)

Created2011

On the Application of Malware Clustering for Threat Intelligence Synthesis

Description

Malware forensics is a time-consuming process that involves a significant amount of data collection. To ease the load on security analysts, many attempts have been made to automate the intelligence gathering process and provide a centralized search interface. Certain of these solutions map existing relations between threats and can discover…

Malware forensics is a time-consuming process that involves a significant amount of data collection. To ease the load on security analysts, many attempts have been made to automate the intelligence gathering process and provide a centralized search interface. Certain of these solutions map existing relations between threats and can discover new intelligence by identifying correlations in the data. However, such systems generally treat each unique malware sample as its own distinct threat. This fails to model the real malware landscape, in which so many ``new" samples are actually variants of samples that have already been discovered. Were there some way to reliably determine whether two malware samples belong to the same family, intelligence for one sample could be applied to any sample in the family, greatly reducing the complexity of intelligence synthesis. Clustering is a common big data approach for grouping data samples which have common features, and has been applied in several recent papers for identifying related malware. It therefore has the potential to be used as described to simplify the intelligence synthesis process. However, existing threat intelligence systems do not use malware clustering. In this paper, we attempt to design a highly accurate malware clustering system, with the ultimate goal of integrating it into a threat intelligence platform. Toward this end, we explore the many considerations of designing such a system: how to extract features to compare malware, and how to use these features for accurate clustering. We then create an experimental clustering system, and evaluate its effectiveness using two different clustering algorithms.

ContributorsSmith, Joshua Michael (Author) / Ahn, Gail-Joon (Thesis director) / Zhao, Ziming (Committee member) / School of Mathematical and Statistical Sciences (Contributor) / Computer Science and Engineering Program (Contributor, Contributor) / Barrett, The Honors College (Contributor)

Created2017-05

Machine Learning Models for High-dimensional Biomedical Data

Description

The recent technological advances enable the collection of various complex, heterogeneous and high-dimensional data in biomedical domains. The increasing availability of the high-dimensional biomedical data creates the needs of new machine learning models for effective data analysis and knowledge discovery. This dissertation introduces several unsupervised and supervised methods to hel…

The recent technological advances enable the collection of various complex, heterogeneous and high-dimensional data in biomedical domains. The increasing availability of the high-dimensional biomedical data creates the needs of new machine learning models for effective data analysis and knowledge discovery. This dissertation introduces several unsupervised and supervised methods to help understand the data, discover the patterns and improve the decision making. All the proposed methods can generalize to other industrial fields.

The first topic of this dissertation focuses on the data clustering. Data clustering is often the first step for analyzing a dataset without the label information. Clustering high-dimensional data with mixed categorical and numeric attributes remains a challenging, yet important task. A clustering algorithm based on tree ensembles, CRAFTER, is proposed to tackle this task in a scalable manner.

The second part of this dissertation aims to develop data representation methods for genome sequencing data, a special type of high-dimensional data in the biomedical domain. The proposed data representation method, Bag-of-Segments, can summarize the key characteristics of the genome sequence into a small number of features with good interpretability.

The third part of this dissertation introduces an end-to-end deep neural network model, GCRNN, for time series classification with emphasis on both the accuracy and the interpretation. GCRNN contains a convolutional network component to extract high-level features, and a recurrent network component to enhance the modeling of the temporal characteristics. A feed-forward fully connected network with the sparse group lasso regularization is used to generate the final classification and provide good interpretability.

The last topic centers around the dimensionality reduction methods for time series data. A good dimensionality reduction method is important for the storage, decision making and pattern visualization for time series data. The CRNN autoencoder is proposed to not only achieve low reconstruction error, but also generate discriminative features. A variational version of this autoencoder has great potential for applications such as anomaly detection and process control.

ContributorsLin, Sangdi (Author) / Runger, George C. (Thesis advisor) / Kocher, Jean-Pierre A (Committee member) / Pan, Rong (Committee member) / Escobedo, Adolfo R. (Committee member) / Arizona State University (Publisher)

Created2018

Analysis of habitual patterns in vernacular movement

Description

This thesis aims to explore the language of different bodies in the field of dance by analyzing

the habitual patterns of dancers from different backgrounds and vernaculars. Contextually,

the term habitual patterns is defined as the postures or poses that tend to re-appear,

often unintentionally, as the dancer performs improvisational dance. The focus…

This thesis aims to explore the language of different bodies in the field of dance by analyzing

the habitual patterns of dancers from different backgrounds and vernaculars. Contextually,

the term habitual patterns is defined as the postures or poses that tend to re-appear,

often unintentionally, as the dancer performs improvisational dance. The focus lies in exposing

the movement vocabulary of a dancer to reveal his/her unique fingerprint.

The proposed approach for uncovering these movement patterns is to use a clustering

technique; mainly k-means. In addition to a static method of analysis, this paper uses

an online method of clustering using a streaming variant of k-means that integrates into

the flow of components that can be used in a real-time interactive dance performance. The

computational system is trained by the dancer to discover identifying patterns and therefore

it enables a feedback loop resulting in a rich exchange between dancer and machine. This

can help break a dancer’s tendency to create similar postures, explore larger kinespheric

space and invent movement beyond their current capabilities.

This paper describes a project that distinguishes itself in that it uses a custom database

that is curated for the purpose of highlighting the similarities and differences between various

movement forms. It puts particular emphasis on the process of choosing source movement

qualitatively, before the technological capture process begins.

ContributorsIyengar, Varsha (Author) / Xin Wei, Sha (Thesis advisor) / Turaga, Pavan (Committee member) / Coleman, Grisha (Committee member) / Arizona State University (Publisher)

Created2016

Filtering by