Matching Items (11)

131701-Thumbnail Image.png

Cartel Coffee Lab: Customer Analysis, Segmentation, & Marketing Impacts

Description

This paper explores a cluster analysis and data-driven marketing recommendations for Cartel Coffee Lab, a local coffee shop. Building a loyal customer base is the biggest asset to a company’s

This paper explores a cluster analysis and data-driven marketing recommendations for Cartel Coffee Lab, a local coffee shop. Building a loyal customer base is the biggest asset to a company’s success. This is crucial when Cartel’s main location has a fast-changing environment, such as a university, located nearby. The reason for this is mainly due to the influx and customer churn that a hub produces. Our team’s focus was on analyzing customer buying patterns to determine the different customer segments within Cartel Coffee Lab. Our methods included creating temporary databases in Microsoft SQL Server and performing a cluster analysis in SAS Enterprise Miner. The cluster analysis results were then used to propose data-driven marketing recommendations for each customer segment in order to provide a well-designed value proposition. Additionally, a loyalty program was recommended in order to increase customer profitability and loyalty, depending on the cluster segment. Through identifying the current situation, business questions, research methodology, analysis results, and plan of action, this creative project will cover multiple key areas of a business analytics project.

Contributors

Agent

Created

Date Created
  • 2020-05

152307-Thumbnail Image.png

Adaptive learning and unsupervised clustering of immune responses using microarray random sequence peptides

Description

Immunosignaturing is a medical test for assessing the health status of a patient by applying microarrays of random sequence peptides to determine the patient's immune fingerprint by associating antibodies from

Immunosignaturing is a medical test for assessing the health status of a patient by applying microarrays of random sequence peptides to determine the patient's immune fingerprint by associating antibodies from a biological sample to immune responses. The immunosignature measurements can potentially provide pre-symptomatic diagnosis for infectious diseases or detection of biological threats. Currently, traditional bioinformatics tools, such as data mining classification algorithms, are used to process the large amount of peptide microarray data. However, these methods generally require training data and do not adapt to changing immune conditions or additional patient information. This work proposes advanced processing techniques to improve the classification and identification of single and multiple underlying immune response states embedded in immunosignatures, making it possible to detect both known and previously unknown diseases or biothreat agents. Novel adaptive learning methodologies for un- supervised and semi-supervised clustering integrated with immunosignature feature extraction approaches are proposed. The techniques are based on extracting novel stochastic features from microarray binding intensities and use Dirichlet process Gaussian mixture models to adaptively cluster the immunosignatures in the feature space. This learning-while-clustering approach allows continuous discovery of antibody activity by adaptively detecting new disease states, with limited a priori disease or patient information. A beta process factor analysis model to determine underlying patient immune responses is also proposed to further improve the adaptive clustering performance by formatting new relationships between patients and antibody activity. In order to extend the clustering methods for diagnosing multiple states in a patient, the adaptive hierarchical Dirichlet process is integrated with modified beta process factor analysis latent feature modeling to identify relationships between patients and infectious agents. The use of Bayesian nonparametric adaptive learning techniques allows for further clustering if additional patient data is received. Significant improvements in feature identification and immune response clustering are demonstrated using samples from patients with different diseases.

Contributors

Agent

Created

Date Created
  • 2013

151587-Thumbnail Image.png

On feature selection stability: a data perspective

Description

The rapid growth in the high-throughput technologies last few decades makes the manual processing of the generated data to be impracticable. Even worse, the machine learning and data mining techniques

The rapid growth in the high-throughput technologies last few decades makes the manual processing of the generated data to be impracticable. Even worse, the machine learning and data mining techniques seemed to be paralyzed against these massive datasets. High-dimensionality is one of the most common challenges for machine learning and data mining tasks. Feature selection aims to reduce dimensionality by selecting a small subset of the features that perform at least as good as the full feature set. Generally, the learning performance, e.g. classification accuracy, and algorithm complexity are used to measure the quality of the algorithm. Recently, the stability of feature selection algorithms has gained an increasing attention as a new indicator due to the necessity to select similar subsets of features each time when the algorithm is run on the same dataset even in the presence of a small amount of perturbation. In order to cure the selection stability issue, we should understand the cause of instability first. In this dissertation, we will investigate the causes of instability in high-dimensional datasets using well-known feature selection algorithms. As a result, we found that the stability mostly data-dependent. According to these findings, we propose a framework to improve selection stability by solving these main causes. In particular, we found that data noise greatly impacts the stability and the learning performance as well. So, we proposed to reduce it in order to improve both selection stability and learning performance. However, current noise reduction approaches are not able to distinguish between data noise and variation in samples from different classes. For this reason, we overcome this limitation by using Supervised noise reduction via Low Rank Matrix Approximation, SLRMA for short. The proposed framework has proved to be successful on different types of datasets with high-dimensionality, such as microarrays and images datasets. However, this framework cannot handle unlabeled, hence, we propose Local SVD to overcome this limitation.

Contributors

Agent

Created

Date Created
  • 2013

155207-Thumbnail Image.png

Target discrimination against clutter based on unsupervised clustering and sequential Monte Carlo tracking

Description

The radar performance of detecting a target and estimating its parameters can deteriorate rapidly in the presence of high clutter. This is because radar measurements due to clutter returns

The radar performance of detecting a target and estimating its parameters can deteriorate rapidly in the presence of high clutter. This is because radar measurements due to clutter returns can be falsely detected as if originating from the actual target. Various data association methods and multiple hypothesis filtering approaches have been considered to solve this problem. Such methods, however, can be computationally intensive for real time radar processing. This work proposes a new approach that is based on the unsupervised clustering of target and clutter detections before target tracking using particle filtering. In particular, Gaussian mixture modeling is first used to separate detections into two Gaussian distinct mixtures. Using eigenvector analysis, the eccentricity of the covariance matrices of the Gaussian mixtures are computed and compared to threshold values that are obtained a priori. The thresholding allows only target detections to be used for target tracking. Simulations demonstrate the performance of the new algorithm and compare it with using k-means for clustering instead of Gaussian mixture modeling.

Contributors

Agent

Created

Date Created
  • 2016

Revealing microbial responses to environmental dynamics: developing methods for analysis and visualization of complex sequence datasets

Description

The greatest barrier to understanding how life interacts with its environment is the complexity in which biology operates. In this work, I present experimental designs, analysis methods, and visualization techniques

The greatest barrier to understanding how life interacts with its environment is the complexity in which biology operates. In this work, I present experimental designs, analysis methods, and visualization techniques to overcome the challenges of deciphering complex biological datasets. First, I examine an iron limitation transcriptome of Synechocystis sp. PCC 6803 using a new methodology. Until now, iron limitation in experiments of Synechocystis sp. PCC 6803 gene expression has been achieved through media chelation. Notably, chelation also reduces the bioavailability of other metals, whereas naturally occurring low iron settings likely result from a lack of iron influx and not as a result of chelation. The overall metabolic trends of previous studies are well-characterized but within those trends is significant variability in single gene expression responses. I compare previous transcriptomics analyses with our protocol that limits the addition of bioavailable iron to growth media to identify consistent gene expression signals resulting from iron limitation. Second, I describe a novel method of improving the reliability of centroid-linkage clustering results. The size and complexity of modern sequencing datasets often prohibit constructing distance matrices, which prevents the use of many common clustering algorithms. Centroid-linkage circumvents the need for a distance matrix, but has the adverse effect of producing input-order dependent results. In this chapter, I describe a method of cluster edge counting across iterated centroid-linkage results and reconstructing aggregate clusters from a ranked edge list without a distance matrix and input-order dependence. Finally, I introduce dendritic heat maps, a new figure type that visualizes heat map responses through expanding and contracting sequence clustering specificities. Heat maps are useful for comparing data across a range of possible states. However, data binning is sensitive to clustering cutoffs which are often arbitrarily introduced by researchers and can substantially change the heat map response of any single data point. With an understanding of how the architectural elements of dendrograms and heat maps affect data visualization, I have integrated their salient features to create a figure type aimed at viewing multiple levels of clustering cutoffs, allowing researchers to better understand the effects of environment on metabolism or phylogenetic lineages.

Contributors

Agent

Created

Date Created
  • 2017

155016-Thumbnail Image.png

Automatic segmentation of single neurons recorded by wide-field imaging using frequency domain features and clustering tree

Description

Recent new experiments showed that wide-field imaging at millimeter scale is capable of recording hundreds of neurons in behaving mice brain. Monitoring hundreds of individual neurons at a high frame

Recent new experiments showed that wide-field imaging at millimeter scale is capable of recording hundreds of neurons in behaving mice brain. Monitoring hundreds of individual neurons at a high frame rate provides a promising tool for discovering spatiotemporal features of large neural networks. However, processing the massive data sets is impossible without automated procedures. Thus, this thesis aims at developing a new tool to automatically segment and track individual neuron cells. The new method used in this study employs two major ideas including feature extraction based on power spectral density of single neuron temporal activity and clustering tree to separate overlapping cells. To address issues associated with high-resolution imaging of a large recording area, focused areas and out-of-focus areas were analyzed separately. A static segmentation with a fixed PSD thresholding method is applied to within focus visual field. A dynamic segmentation by comparing maximum PSD with surrounding pixels is applied to out-of-focus area. Both approaches helped remove irrelevant pixels in the background. After detection of potential single cells, some of which appeared in groups due to overlapping cells in the image, a hierarchical clustering algorithm is applied to separate them. The hierarchical clustering uses correlation coefficient as a distance measurement to group similar pixels into single cells. As such, overlapping cells can be separated. We tested the entire algorithm using two real recordings with the respective truth carefully determined by manual inspections. The results show high accuracy on tested datasets while false positive error is controlled within an acceptable range. Furthermore, results indicate robustness of the algorithm when applied to different image sequences.

Contributors

Agent

Created

Date Created
  • 2016

156053-Thumbnail Image.png

A Data Mining Approach to Modeling Customer Preference: A Case Study of Intel Corporation

Description

Understanding customer preference is crucial for new product planning and marketing decisions. This thesis explores how historical data can be leveraged to understand and predict customer preference. This thesis presents

Understanding customer preference is crucial for new product planning and marketing decisions. This thesis explores how historical data can be leveraged to understand and predict customer preference. This thesis presents a decision support framework that provides a holistic view on customer preference by following a two-phase procedure. Phase-1 uses cluster analysis to create product profiles based on which customer profiles are derived. Phase-2 then delves deep into each of the customer profiles and investigates causality behind their preference using Bayesian networks. This thesis illustrates the working of the framework using the case of Intel Corporation, world’s largest semiconductor manufacturing company.

Contributors

Agent

Created

Date Created
  • 2017

153478-Thumbnail Image.png

Controversy analysis: clustering and ranking polarized networks with visualizations

Description

US Senate is the venue of political debates where the federal bills are formed and voted. Senators show their support/opposition along the bills with their votes. This information makes it

US Senate is the venue of political debates where the federal bills are formed and voted. Senators show their support/opposition along the bills with their votes. This information makes it possible to extract the polarity of the senators. Similarly, blogosphere plays an increasingly important role as a forum for public debate. Authors display sentiment toward issues, organizations or people using a natural language.

In this research, given a mixed set of senators/blogs debating on a set of political issues from opposing camps, I use signed bipartite graphs for modeling debates, and I propose an algorithm for partitioning both the opinion holders (senators or blogs) and the issues (bills or topics) comprising the debate into binary opposing camps. Simultaneously, my algorithm scales the entities on a univariate scale. Using this scale, a researcher can identify moderate and extreme senators/blogs within each camp, and polarizing versus unifying issues. Through performance evaluations I show that my proposed algorithm provides an effective solution to the problem, and performs much better than existing baseline algorithms adapted to solve this new problem. In my experiments, I used both real data from political blogosphere and US Congress records, as well as synthetic data which were obtained by varying polarization and degree distribution of the vertices of the graph to show the robustness of my algorithm.

I also applied my algorithm on all the terms of the US Senate to the date for longitudinal analysis and developed a web based interactive user interface www.PartisanScale.com to visualize the analysis.

US politics is most often polarized with respect to the left/right alignment of the entities. However, certain issues do not reflect the polarization due to political parties, but observe a split correlating to the demographics of the senators, or simply receive consensus. I propose a hierarchical clustering algorithm that identifies groups of bills that share the same polarization characteristics. I developed a web based interactive user interface www.ControversyAnalysis.com to visualize the clusters while providing a synopsis through distribution charts, word clouds, and heat maps.

Contributors

Agent

Created

Date Created
  • 2015

154403-Thumbnail Image.png

Visual analytics for spatiotemporal cluster analysis

Description

Traditionally, visualization is one of the most important and commonly used methods of generating insight into large scale data. Particularly for spatiotemporal data, the translation of such data into a

Traditionally, visualization is one of the most important and commonly used methods of generating insight into large scale data. Particularly for spatiotemporal data, the translation of such data into a visual form allows users to quickly see patterns, explore summaries and relate domain knowledge about underlying geographical phenomena that would not be apparent in tabular form. However, several critical challenges arise when visualizing and exploring these large spatiotemporal datasets. While, the underlying geographical component of the data lends itself well to univariate visualization in the form of traditional cartographic representations (e.g., choropleth, isopleth, dasymetric maps), as the data becomes multivariate, cartographic representations become more complex. To simplify the visual representations, analytical methods such as clustering and feature extraction are often applied as part of the classification phase. The automatic classification can then be rendered onto a map; however, one common issue in data classification is that items near a classification boundary are often mislabeled.

This thesis explores methods to augment the automated spatial classification by utilizing interactive machine learning as part of the cluster creation step. First, this thesis explores the design space for spatiotemporal analysis through the development of a comprehensive data wrangling and exploratory data analysis platform. Second, this system is augmented with a novel method for evaluating the visual impact of edge cases for multivariate geographic projections. Finally, system features and functionality are demonstrated through a series of case studies, with key features including similarity analysis, multivariate clustering, and novel visual support for cluster comparison.

Contributors

Agent

Created

Date Created
  • 2016

154633-Thumbnail Image.png

Analysis of habitual patterns in vernacular movement

Description

This thesis aims to explore the language of different bodies in the field of dance by analyzing

the habitual patterns of dancers from different backgrounds and vernaculars. Contextually,

the term habitual patterns

This thesis aims to explore the language of different bodies in the field of dance by analyzing

the habitual patterns of dancers from different backgrounds and vernaculars. Contextually,

the term habitual patterns is defined as the postures or poses that tend to re-appear,

often unintentionally, as the dancer performs improvisational dance. The focus lies in exposing

the movement vocabulary of a dancer to reveal his/her unique fingerprint.

The proposed approach for uncovering these movement patterns is to use a clustering

technique; mainly k-means. In addition to a static method of analysis, this paper uses

an online method of clustering using a streaming variant of k-means that integrates into

the flow of components that can be used in a real-time interactive dance performance. The

computational system is trained by the dancer to discover identifying patterns and therefore

it enables a feedback loop resulting in a rich exchange between dancer and machine. This

can help break a dancer’s tendency to create similar postures, explore larger kinespheric

space and invent movement beyond their current capabilities.

This paper describes a project that distinguishes itself in that it uses a custom database

that is curated for the purpose of highlighting the similarities and differences between various

movement forms. It puts particular emphasis on the process of choosing source movement

qualitatively, before the technological capture process begins.

Contributors

Agent

Created

Date Created
  • 2016