Matching Items (11)

152307-Thumbnail Image.png

Adaptive learning and unsupervised clustering of immune responses using microarray random sequence peptides

Description

Immunosignaturing is a medical test for assessing the health status of a patient by applying microarrays of random sequence peptides to determine the patient's immune fingerprint by associating antibodies from a biological sample to immune responses. The immunosignature measurements can

Immunosignaturing is a medical test for assessing the health status of a patient by applying microarrays of random sequence peptides to determine the patient's immune fingerprint by associating antibodies from a biological sample to immune responses. The immunosignature measurements can potentially provide pre-symptomatic diagnosis for infectious diseases or detection of biological threats. Currently, traditional bioinformatics tools, such as data mining classification algorithms, are used to process the large amount of peptide microarray data. However, these methods generally require training data and do not adapt to changing immune conditions or additional patient information. This work proposes advanced processing techniques to improve the classification and identification of single and multiple underlying immune response states embedded in immunosignatures, making it possible to detect both known and previously unknown diseases or biothreat agents. Novel adaptive learning methodologies for un- supervised and semi-supervised clustering integrated with immunosignature feature extraction approaches are proposed. The techniques are based on extracting novel stochastic features from microarray binding intensities and use Dirichlet process Gaussian mixture models to adaptively cluster the immunosignatures in the feature space. This learning-while-clustering approach allows continuous discovery of antibody activity by adaptively detecting new disease states, with limited a priori disease or patient information. A beta process factor analysis model to determine underlying patient immune responses is also proposed to further improve the adaptive clustering performance by formatting new relationships between patients and antibody activity. In order to extend the clustering methods for diagnosing multiple states in a patient, the adaptive hierarchical Dirichlet process is integrated with modified beta process factor analysis latent feature modeling to identify relationships between patients and infectious agents. The use of Bayesian nonparametric adaptive learning techniques allows for further clustering if additional patient data is received. Significant improvements in feature identification and immune response clustering are demonstrated using samples from patients with different diseases.

Contributors

Agent

Created

Date Created
2013

131701-Thumbnail Image.png

Cartel Coffee Lab: Customer Analysis, Segmentation, & Marketing Impacts

Description

This paper explores a cluster analysis and data-driven marketing recommendations for Cartel Coffee Lab, a local coffee shop. Building a loyal customer base is the biggest asset to a company’s success. This is crucial when Cartel’s main location has a

This paper explores a cluster analysis and data-driven marketing recommendations for Cartel Coffee Lab, a local coffee shop. Building a loyal customer base is the biggest asset to a company’s success. This is crucial when Cartel’s main location has a fast-changing environment, such as a university, located nearby. The reason for this is mainly due to the influx and customer churn that a hub produces. Our team’s focus was on analyzing customer buying patterns to determine the different customer segments within Cartel Coffee Lab. Our methods included creating temporary databases in Microsoft SQL Server and performing a cluster analysis in SAS Enterprise Miner. The cluster analysis results were then used to propose data-driven marketing recommendations for each customer segment in order to provide a well-designed value proposition. Additionally, a loyalty program was recommended in order to increase customer profitability and loyalty, depending on the cluster segment. Through identifying the current situation, business questions, research methodology, analysis results, and plan of action, this creative project will cover multiple key areas of a business analytics project.

Contributors

Agent

Created

Date Created
2020-05

154403-Thumbnail Image.png

Visual analytics for spatiotemporal cluster analysis

Description

Traditionally, visualization is one of the most important and commonly used methods of generating insight into large scale data. Particularly for spatiotemporal data, the translation of such data into a visual form allows users to quickly see patterns, explore summaries

Traditionally, visualization is one of the most important and commonly used methods of generating insight into large scale data. Particularly for spatiotemporal data, the translation of such data into a visual form allows users to quickly see patterns, explore summaries and relate domain knowledge about underlying geographical phenomena that would not be apparent in tabular form. However, several critical challenges arise when visualizing and exploring these large spatiotemporal datasets. While, the underlying geographical component of the data lends itself well to univariate visualization in the form of traditional cartographic representations (e.g., choropleth, isopleth, dasymetric maps), as the data becomes multivariate, cartographic representations become more complex. To simplify the visual representations, analytical methods such as clustering and feature extraction are often applied as part of the classification phase. The automatic classification can then be rendered onto a map; however, one common issue in data classification is that items near a classification boundary are often mislabeled.

This thesis explores methods to augment the automated spatial classification by utilizing interactive machine learning as part of the cluster creation step. First, this thesis explores the design space for spatiotemporal analysis through the development of a comprehensive data wrangling and exploratory data analysis platform. Second, this system is augmented with a novel method for evaluating the visual impact of edge cases for multivariate geographic projections. Finally, system features and functionality are demonstrated through a series of case studies, with key features including similarity analysis, multivariate clustering, and novel visual support for cluster comparison.

Contributors

Agent

Created

Date Created
2016

149443-Thumbnail Image.png

Public health surveillance in high-dimensions with supervised learning

Description

Public health surveillance is a special case of the general problem where counts (or rates) of events are monitored for changes. Modern data complements event counts with many additional measurements (such as geographic, demographic, and others) that comprise high-dimensional covariates.

Public health surveillance is a special case of the general problem where counts (or rates) of events are monitored for changes. Modern data complements event counts with many additional measurements (such as geographic, demographic, and others) that comprise high-dimensional covariates. This leads to an important challenge to detect a change that only occurs within a region, initially unspecified, defined by these covariates. Current methods are typically limited to spatial and/or temporal covariate information and often fail to use all the information available in modern data that can be paramount in unveiling these subtle changes. Additional complexities associated with modern health data that are often not accounted for by traditional methods include: covariates of mixed type, missing values, and high-order interactions among covariates. This work proposes a transform of public health surveillance to supervised learning, so that an appropriate learner can inherently address all the complexities described previously. At the same time, quantitative measures from the learner can be used to define signal criteria to detect changes in rates of events. A Feature Selection (FS) method is used to identify covariates that contribute to a model and to generate a signal. A measure of statistical significance is included to control false alarms. An alternative Percentile method identifies the specific cases that lead to changes using class probability estimates from tree-based ensembles. This second method is intended to be less computationally intensive and significantly simpler to implement. Finally, a third method labeled Rule-Based Feature Value Selection (RBFVS) is proposed for identifying the specific regions in high-dimensional space where the changes are occurring. Results on simulated examples are used to compare the FS method and the Percentile method. Note this work emphasizes the application of the proposed methods on public health surveillance. Nonetheless, these methods can easily be extended to a variety of applications where counts (or rates) of events are monitored for changes. Such problems commonly occur in domains such as manufacturing, economics, environmental systems, engineering, as well as in public health.

Contributors

Agent

Created

Date Created
2010

153478-Thumbnail Image.png

Controversy analysis: clustering and ranking polarized networks with visualizations

Description

US Senate is the venue of political debates where the federal bills are formed and voted. Senators show their support/opposition along the bills with their votes. This information makes it possible to extract the polarity of the senators. Similarly, blogosphere

US Senate is the venue of political debates where the federal bills are formed and voted. Senators show their support/opposition along the bills with their votes. This information makes it possible to extract the polarity of the senators. Similarly, blogosphere plays an increasingly important role as a forum for public debate. Authors display sentiment toward issues, organizations or people using a natural language.

In this research, given a mixed set of senators/blogs debating on a set of political issues from opposing camps, I use signed bipartite graphs for modeling debates, and I propose an algorithm for partitioning both the opinion holders (senators or blogs) and the issues (bills or topics) comprising the debate into binary opposing camps. Simultaneously, my algorithm scales the entities on a univariate scale. Using this scale, a researcher can identify moderate and extreme senators/blogs within each camp, and polarizing versus unifying issues. Through performance evaluations I show that my proposed algorithm provides an effective solution to the problem, and performs much better than existing baseline algorithms adapted to solve this new problem. In my experiments, I used both real data from political blogosphere and US Congress records, as well as synthetic data which were obtained by varying polarization and degree distribution of the vertices of the graph to show the robustness of my algorithm.

I also applied my algorithm on all the terms of the US Senate to the date for longitudinal analysis and developed a web based interactive user interface www.PartisanScale.com to visualize the analysis.

US politics is most often polarized with respect to the left/right alignment of the entities. However, certain issues do not reflect the polarization due to political parties, but observe a split correlating to the demographics of the senators, or simply receive consensus. I propose a hierarchical clustering algorithm that identifies groups of bills that share the same polarization characteristics. I developed a web based interactive user interface www.ControversyAnalysis.com to visualize the clusters while providing a synopsis through distribution charts, word clouds, and heat maps.

Contributors

Agent

Created

Date Created
2015

151587-Thumbnail Image.png

On feature selection stability: a data perspective

Description

The rapid growth in the high-throughput technologies last few decades makes the manual processing of the generated data to be impracticable. Even worse, the machine learning and data mining techniques seemed to be paralyzed against these massive datasets. High-dimensionality is

The rapid growth in the high-throughput technologies last few decades makes the manual processing of the generated data to be impracticable. Even worse, the machine learning and data mining techniques seemed to be paralyzed against these massive datasets. High-dimensionality is one of the most common challenges for machine learning and data mining tasks. Feature selection aims to reduce dimensionality by selecting a small subset of the features that perform at least as good as the full feature set. Generally, the learning performance, e.g. classification accuracy, and algorithm complexity are used to measure the quality of the algorithm. Recently, the stability of feature selection algorithms has gained an increasing attention as a new indicator due to the necessity to select similar subsets of features each time when the algorithm is run on the same dataset even in the presence of a small amount of perturbation. In order to cure the selection stability issue, we should understand the cause of instability first. In this dissertation, we will investigate the causes of instability in high-dimensional datasets using well-known feature selection algorithms. As a result, we found that the stability mostly data-dependent. According to these findings, we propose a framework to improve selection stability by solving these main causes. In particular, we found that data noise greatly impacts the stability and the learning performance as well. So, we proposed to reduce it in order to improve both selection stability and learning performance. However, current noise reduction approaches are not able to distinguish between data noise and variation in samples from different classes. For this reason, we overcome this limitation by using Supervised noise reduction via Low Rank Matrix Approximation, SLRMA for short. The proposed framework has proved to be successful on different types of datasets with high-dimensionality, such as microarrays and images datasets. However, this framework cannot handle unlabeled, hence, we propose Local SVD to overcome this limitation.

Contributors

Agent

Created

Date Created
2013

154633-Thumbnail Image.png

Analysis of habitual patterns in vernacular movement

Description

This thesis aims to explore the language of different bodies in the field of dance by analyzing

the habitual patterns of dancers from different backgrounds and vernaculars. Contextually,

the term habitual patterns is defined as the postures or poses that tend to

This thesis aims to explore the language of different bodies in the field of dance by analyzing

the habitual patterns of dancers from different backgrounds and vernaculars. Contextually,

the term habitual patterns is defined as the postures or poses that tend to re-appear,

often unintentionally, as the dancer performs improvisational dance. The focus lies in exposing

the movement vocabulary of a dancer to reveal his/her unique fingerprint.

The proposed approach for uncovering these movement patterns is to use a clustering

technique; mainly k-means. In addition to a static method of analysis, this paper uses

an online method of clustering using a streaming variant of k-means that integrates into

the flow of components that can be used in a real-time interactive dance performance. The

computational system is trained by the dancer to discover identifying patterns and therefore

it enables a feedback loop resulting in a rich exchange between dancer and machine. This

can help break a dancer’s tendency to create similar postures, explore larger kinespheric

space and invent movement beyond their current capabilities.

This paper describes a project that distinguishes itself in that it uses a custom database

that is curated for the purpose of highlighting the similarities and differences between various

movement forms. It puts particular emphasis on the process of choosing source movement

qualitatively, before the technological capture process begins.

Contributors

Agent

Created

Date Created
2016

155207-Thumbnail Image.png

Target discrimination against clutter based on unsupervised clustering and sequential Monte Carlo tracking

Description

The radar performance of detecting a target and estimating its parameters can deteriorate rapidly in the presence of high clutter. This is because radar measurements due to clutter returns can be falsely detected as if originating from the actual

The radar performance of detecting a target and estimating its parameters can deteriorate rapidly in the presence of high clutter. This is because radar measurements due to clutter returns can be falsely detected as if originating from the actual target. Various data association methods and multiple hypothesis filtering approaches have been considered to solve this problem. Such methods, however, can be computationally intensive for real time radar processing. This work proposes a new approach that is based on the unsupervised clustering of target and clutter detections before target tracking using particle filtering. In particular, Gaussian mixture modeling is first used to separate detections into two Gaussian distinct mixtures. Using eigenvector analysis, the eccentricity of the covariance matrices of the Gaussian mixtures are computed and compared to threshold values that are obtained a priori. The thresholding allows only target detections to be used for target tracking. Simulations demonstrate the performance of the new algorithm and compare it with using k-means for clustering instead of Gaussian mixture modeling.

Contributors

Agent

Created

Date Created
2016

155016-Thumbnail Image.png

Automatic segmentation of single neurons recorded by wide-field imaging using frequency domain features and clustering tree

Description

Recent new experiments showed that wide-field imaging at millimeter scale is capable of recording hundreds of neurons in behaving mice brain. Monitoring hundreds of individual neurons at a high frame rate provides a promising tool for discovering spatiotemporal features of

Recent new experiments showed that wide-field imaging at millimeter scale is capable of recording hundreds of neurons in behaving mice brain. Monitoring hundreds of individual neurons at a high frame rate provides a promising tool for discovering spatiotemporal features of large neural networks. However, processing the massive data sets is impossible without automated procedures. Thus, this thesis aims at developing a new tool to automatically segment and track individual neuron cells. The new method used in this study employs two major ideas including feature extraction based on power spectral density of single neuron temporal activity and clustering tree to separate overlapping cells. To address issues associated with high-resolution imaging of a large recording area, focused areas and out-of-focus areas were analyzed separately. A static segmentation with a fixed PSD thresholding method is applied to within focus visual field. A dynamic segmentation by comparing maximum PSD with surrounding pixels is applied to out-of-focus area. Both approaches helped remove irrelevant pixels in the background. After detection of potential single cells, some of which appeared in groups due to overlapping cells in the image, a hierarchical clustering algorithm is applied to separate them. The hierarchical clustering uses correlation coefficient as a distance measurement to group similar pixels into single cells. As such, overlapping cells can be separated. We tested the entire algorithm using two real recordings with the respective truth carefully determined by manual inspections. The results show high accuracy on tested datasets while false positive error is controlled within an acceptable range. Furthermore, results indicate robustness of the algorithm when applied to different image sequences.

Contributors

Agent

Created

Date Created
2016

156053-Thumbnail Image.png

A Data Mining Approach to Modeling Customer Preference: A Case Study of Intel Corporation

Description

Understanding customer preference is crucial for new product planning and marketing decisions. This thesis explores how historical data can be leveraged to understand and predict customer preference. This thesis presents a decision support framework that provides a holistic view on

Understanding customer preference is crucial for new product planning and marketing decisions. This thesis explores how historical data can be leveraged to understand and predict customer preference. This thesis presents a decision support framework that provides a holistic view on customer preference by following a two-phase procedure. Phase-1 uses cluster analysis to create product profiles based on which customer profiles are derived. Phase-2 then delves deep into each of the customer profiles and investigates causality behind their preference using Bayesian networks. This thesis illustrates the working of the framework using the case of Intel Corporation, world’s largest semiconductor manufacturing company.

Contributors

Agent

Created

Date Created
2017