Search Content

Embedded Feature Selection for Model-based Clustering

Description

Model-based clustering is a sub-field of statistical modeling and machine learning. The mixture models use the probability to describe the degree of the data point belonging to the cluster, and the probability is updated iteratively during the clustering. While mixture models have demonstrated the superior performance in handling noisy data…

Model-based clustering is a sub-field of statistical modeling and machine learning. The mixture models use the probability to describe the degree of the data point belonging to the cluster, and the probability is updated iteratively during the clustering. While mixture models have demonstrated the superior performance in handling noisy data in many fields, there exist some challenges for high dimensional dataset. It is noted that among a large number of features, some may not indeed contribute to delineate the cluster profiles. The inclusion of these “noisy” features will confuse the model to identify the real structure of the clusters and cost more computational time. Recognizing the issue, in this dissertation, I propose a new feature selection algorithm for continuous dataset first and then extend to mixed datatype. Finally, I conduct uncertainty quantification for the feature selection results as the third topic.

The first topic is an embedded feature selection algorithm termed Expectation-Selection-Maximization (ESM) model that can automatically select features while optimizing the parameters for Gaussian Mixture Model. I introduce a relevancy index (RI) revealing the contribution of the feature in the clustering process to assist feature selection. I demonstrate the efficacy of the ESM by studying two synthetic datasets, four benchmark datasets, and an Alzheimer’s Disease dataset.

The second topic focuses on extending the application of ESM algorithm to handle mixed datatypes. The Gaussian mixture model is generalized to Generalized Model of Mixture (GMoM), which can not only handle continuous features, but also binary and nominal features.

The last topic is about Uncertainty Quantification (UQ) of the feature selection. A new algorithm termed ESOM is proposed, which takes the variance information into consideration while conducting feature selection. Also, a set of outliers are generated in the feature selection process to infer the uncertainty in the input data. Finally, the selected features and detected outlier instances are evaluated by visualization comparison.

ContributorsFu, Yinlin (Author) / Wu, Teresa (Thesis advisor) / Mirchandani, Pitu (Committee member) / Li, Jing (Committee member) / Pedrielli, Giulia (Committee member) / Arizona State University (Publisher)

Created2020

Outlier-Aware Applications in High-Dimensional Industrial Systems

Description

High-dimensional data is omnipresent in modern industrial systems. An imaging sensor in a manufacturing plant a can take images of millions of pixels or a sensor may collect months of data at very granular time steps. Dimensionality reduction techniques are commonly used for dealing with such data. In addition, outliers…

High-dimensional data is omnipresent in modern industrial systems. An imaging sensor in a manufacturing plant a can take images of millions of pixels or a sensor may collect months of data at very granular time steps. Dimensionality reduction techniques are commonly used for dealing with such data. In addition, outliers typically exist in such data, which may be of direct or indirect interest given the nature of the problem that is being solved. Current research does not address the interdependent nature of dimensionality reduction and outliers. Some works ignore the existence of outliers altogether—which discredits the robustness of these methods in real life—while others provide suboptimal, often band-aid solutions. In this dissertation, I propose novel methods to achieve outlier-awareness in various dimensionality reduction methods. The problem is considered from many different angles depend- ing on the dimensionality reduction technique used (e.g., deep autoencoder, tensors), the nature of the application (e.g., manufacturing, transportation) and the outlier structure (e.g., sparse point anomalies, novelties).

ContributorsSergin, Nurettin Dorukhan (Author) / Yan, Hao (Thesis advisor) / Li, Jing (Committee member) / Wu, Teresa (Committee member) / Tsung, Fugee (Committee member) / Arizona State University (Publisher)

Created2021

mHealth Patient Care Improvement Study Through Statistical Analysis

Description

Technological applications are continually being developed in the healthcare industry as technology becomes increasingly more available. In recent years, companies have started creating mobile applications to address various conditions and diseases. This falls under mHealth or the “use of mobile phones and other wireless technology in medical care” (Rouse, 2018).…

Technological applications are continually being developed in the healthcare industry as technology becomes increasingly more available. In recent years, companies have started creating mobile applications to address various conditions and diseases. This falls under mHealth or the “use of mobile phones and other wireless technology in medical care” (Rouse, 2018). The goal of this study was to identify if data gathered through the use of mHealth methods can be used to build predictive models. The first part of this thesis contains a literature review presenting relevant definitions and several potential studies that involved the use of technology in healthcare applications. The second part of this thesis focuses on data from one study, where regression analysis is used to develop predictive models.

Rouse, M. (2018). mHealth (mobile health). Retrieved from https://searchhealthit.techtarget.com/definition/mHealth

ContributorsAkers, Lindsay (Co-author) / Kiraly, Alyssa (Co-author) / Li, Jing (Thesis director) / Yoon, Hyunsoo (Committee member) / Industrial, Systems & Operations Engineering Prgm (Contributor) / Barrett, The Honors College (Contributor)

Created2020-05

Metabolic Remodeling of Membrane Glycerolipids in the Microalga Nannochloropsis Oceanica Under Nitrogen Deprivation

Description

The lack of lipidome analytical tools has limited our ability to gain new knowledge about lipid metabolism in microalgae, especially for membrane glycerolipids. An electrospray ionization mass spectrometry-based lipidomics method was developed for Nannochloropsis oceanica IMET1, which resolved 41 membrane glycerolipids molecular species belonging to eight classes. Changes in membrane…

The lack of lipidome analytical tools has limited our ability to gain new knowledge about lipid metabolism in microalgae, especially for membrane glycerolipids. An electrospray ionization mass spectrometry-based lipidomics method was developed for Nannochloropsis oceanica IMET1, which resolved 41 membrane glycerolipids molecular species belonging to eight classes. Changes in membrane glycerolipids under nitrogen deprivation and high-light (HL) conditions were uncovered. The results showed that the amount of plastidial membrane lipids including monogalactosyldiacylglycerol, phosphatidylglycerol, and the extraplastidic lipids diacylglyceryl-O-4′-(N, N, N,-trimethyl) homoserine and phosphatidylcholine decreased drastically under HL and nitrogen deprivation stresses. Algal cells accumulated considerably more digalactosyldiacylglycerol and sulfoquinovosyldiacylglycerols under stresses. The genes encoding enzymes responsible for biosynthesis, modification and degradation of glycerolipids were identified by mining a time-course global RNA-seq data set. It suggested that reduction in lipid contents under nitrogen deprivation is not attributable to the retarded biosynthesis processes, at least at the gene expression level, as most genes involved in their biosynthesis were unaffected by nitrogen supply, yet several genes were significantly up-regulated. Additionally, a conceptual eicosapentaenoic acid (EPA) biosynthesis network is proposed based on the lipidomic and transcriptomic data, which underlined import of EPA from cytosolic glycerolipids to the plastid for synthesizing EPA-containing chloroplast membrane lipids.

ContributorsHan, Danxiang (Author) / Jia, Jing (Author) / Li, Jing (Author) / Sommerfeld, Milton (Author) / Xu, Jian (Author) / Hu, Qiang (Author) / College of Liberal Arts and Sciences (Contributor)

Created2017-08-04

Multi-Parametric MRI and Texture Analysis to Visualize Spatial Histologic Heterogeneity and Tumor Extent in Glioblastoma

Description

Background: Genetic profiling represents the future of neuro-oncology but suffers from inadequate biopsies in heterogeneous tumors like Glioblastoma (GBM). Contrast-enhanced MRI (CE-MRI) targets enhancing core (ENH) but yields adequate tumor in only ~60% of cases. Further, CE-MRI poorly localizes infiltrative tumor within surrounding non-enhancing parenchyma, or brain-around-tumor (BAT), despite the importance…

Background: Genetic profiling represents the future of neuro-oncology but suffers from inadequate biopsies in heterogeneous tumors like Glioblastoma (GBM). Contrast-enhanced MRI (CE-MRI) targets enhancing core (ENH) but yields adequate tumor in only ~60% of cases. Further, CE-MRI poorly localizes infiltrative tumor within surrounding non-enhancing parenchyma, or brain-around-tumor (BAT), despite the importance of characterizing this tumor segment, which universally recurs. In this study, we use multiple texture analysis and machine learning (ML) algorithms to analyze multi-parametric MRI, and produce new images indicating tumor-rich targets in GBM.

Methods: We recruited primary GBM patients undergoing image-guided biopsies and acquired pre-operative MRI: CE-MRI, Dynamic-Susceptibility-weighted-Contrast-enhanced-MRI, and Diffusion Tensor Imaging. Following image coregistration and region of interest placement at biopsy locations, we compared MRI metrics and regional texture with histologic diagnoses of high- vs low-tumor content (≥80% vs <80% tumor nuclei) for corresponding samples. In a training set, we used three texture analysis algorithms and three ML methods to identify MRI-texture features that optimized model accuracy to distinguish tumor content. We confirmed model accuracy in a separate validation set.

Results: We collected 82 biopsies from 18 GBMs throughout ENH and BAT. The MRI-based model achieved 85% cross-validated accuracy to diagnose high- vs low-tumor in the training set (60 biopsies, 11 patients). The model achieved 81.8% accuracy in the validation set (22 biopsies, 7 patients).

Conclusion: Multi-parametric MRI and texture analysis can help characterize and visualize GBM’s spatial histologic heterogeneity to identify regional tumor-rich biopsy targets.

ContributorsHu, Leland S. (Author) / Ning, Shuluo (Author) / Eschbacher, Jennifer M. (Author) / Gaw, Nathan (Author) / Dueck, Amylou C. (Author) / Smith, Kris A. (Author) / Nakaji, Peter (Author) / Plasencia, Jonathan (Author) / Ranjbar, Sara (Author) / Price, Stephen J. (Author) / Tran, Nhan (Author) / Loftus, Joseph (Author) / Jenkins, Robert (Author) / O'Neill, Brian P. (Author) / Elmquist, William (Author) / Baxter, Leslie C. (Author) / Gao, Fei (Author) / Frakes, David (Author) / Karis, John P. (Author) / Zwart, Christine (Author) / Swanson, Kristin R. (Author) / Sarkaria, Jann (Author) / Wu, Teresa (Author) / Mitchell, J. Ross (Author) / Li, Jing (Author) / Ira A. Fulton Schools of Engineering (Contributor)

Created2015-11-24