Matching Items (217)
- Creators: Arizona State University
- Member of: ASU Electronic Theses and Dissertations
- Status: Published
Multi-label learning, which deals with data associated with multiple labels simultaneously, is ubiquitous in real-world applications. To overcome the curse of dimensionality in multi-label learning, in this thesis I study multi-label dimensionality reduction, which extracts a small number of features by removing the irrelevant, redundant, and noisy information while considering the correlation among different labels in multi-label learning. Specifically, I propose Hypergraph Spectral Learning (HSL) to perform dimensionality reduction for multi-label data by exploiting correlations among different labels using a hypergraph. The regularization effect on the classical dimensionality reduction algorithm known as Canonical Correlation Analysis (CCA) is elucidated in this thesis. The relationship between CCA and Orthonormalized Partial Least Squares (OPLS) is also investigated. To perform dimensionality reduction efficiently for large-scale problems, two efficient implementations are proposed for a class of dimensionality reduction algorithms, including canonical correlation analysis, orthonormalized partial least squares, linear discriminant analysis, and hypergraph spectral learning. The first approach is a direct least squares approach which allows the use of different regularization penalties, but is applicable under a certain assumption; the second one is a two-stage approach which can be applied in the regularization setting without any assumption. Furthermore, an online implementation for the same class of dimensionality reduction algorithms is proposed when the data comes sequentially. A Matlab toolbox for multi-label dimensionality reduction has been developed and released. The proposed algorithms have been applied successfully in the Drosophila gene expression pattern image annotation. The experimental results on some benchmark data sets in multi-label learning also demonstrate the effectiveness and efficiency of the proposed algorithms.
Real-world environments are characterized by non-stationary and continuously evolving data. Learning a classification model on this data would require a framework that is able to adapt itself to newer circumstances. Under such circumstances, transfer learning has come to be a dependable methodology for improving classification performance with reduced training costs and without the need for explicit relearning from scratch. In this thesis, a novel instance transfer technique that adapts a "Cost-sensitive" variation of AdaBoost is presented. The method capitalizes on the theoretical and functional properties of AdaBoost to selectively reuse outdated training instances obtained from a "source" domain to effectively classify unseen instances occurring in a different, but related "target" domain. The algorithm is evaluated on real-world classification problems namely accelerometer based 3D gesture recognition, smart home activity recognition and text categorization. The performance on these datasets is analyzed and evaluated against popular boosting-based instance transfer techniques. In addition, supporting empirical studies, that investigate some of the less explored bottlenecks of boosting based instance transfer methods, are presented, to understand the suitability and effectiveness of this form of knowledge transfer.
Colorectal cancer is the second-highest cause of cancer-related deaths in the United States with approximately 50,000 estimated deaths in 2015. The advanced stages of colorectal cancer has a poor five-year survival rate of 10%, whereas the diagnosis in early stages of development has showed a more favorable five-year survival rate of 90%. Early diagnosis of colorectal cancer is achievable if colorectal polyps, a possible precursor to cancer, are detected and removed before developing into malignancy.
The preferred method for polyp detection and removal is optical colonoscopy. A colonoscopic procedure consists of two phases: (1) insertion phase during which a flexible endoscope (a flexible tube with a tiny video camera at the tip) is advanced via the anus and then gradually to the end of the colon--called the cecum, and (2) withdrawal phase during which the endoscope is gradually withdrawn while colonoscopists examine the colon wall to find and remove polyps. Colonoscopy is an effective procedure and has led to a significant decline in the incidence and mortality of colon cancer. However, despite many screening and therapeutic advantages, 1 out of every 4 polyps and 1 out of 13 colon cancers are missed during colonoscopy.
There are many factors that contribute to missed polyps and cancers including poor colon preparation, inadequate navigational skills, and fatigue. Poor colon preparation results in a substantial portion of colon covered with fecal content, hindering a careful examination of the colon. Inadequate navigational skills can prevent a colonoscopist from examining hard-to-reach regions of the colon that may contain a polyp. Fatigue can manifest itself in the performance of a colonoscopist by decreasing diligence and vigilance during procedures. Lack of vigilance may prevent a colonoscopist from detecting the polyps that briefly appear in the colonoscopy videos. Lack of diligence may result in hasty examination of the colon that is likely to miss polyps and lesions.
To reduce polyp and cancer miss rates, this research presents a quality assurance system with 3 components. The first component is an automatic polyp detection system that highlights the regions with suspected polyps in colonoscopy videos. The goal is to encourage more vigilance during procedures. The suggested polyp detection system consists of several novel modules: (1) a new patch descriptor that characterizes image appearance around boundaries more accurately and more efficiently than widely-used patch descriptors such HoG, LBP, and Daisy; (2) A 2-stage classification framework that is able to enhance low level image features prior to classification. Unlike the traditional way of image classification where a single patch undergoes the processing pipeline, our system fuses the information extracted from a pair of patches for more accurate edge classification; (3) a new vote accumulation scheme that robustly localizes objects with curvy boundaries in fragmented edge maps. Our voting scheme produces a probabilistic output for each polyp candidate but unlike the existing methods (e.g., Hough transform) does not require any predefined parametric model of the object of interest; (4) and a unique three-way image representation coupled with convolutional neural networks (CNNs) for classifying the polyp candidates. Our image representation efficiently captures a variety of features such as color, texture, shape, and temporal information and significantly improves the performance of the subsequent CNNs for candidate classification. This contrasts with the exiting methods that mainly rely on a subset of the above image features for polyp detection. Furthermore, this research is the first to investigate the use of CNNs for polyp detection in colonoscopy videos.
The second component of our quality assurance system is an automatic image quality assessment for colonoscopy. The goal is to encourage more diligence during procedures by warning against hasty and low quality colon examination. We detect a low quality colon examination by identifying a number of consecutive non-informative frames in videos. We base our methodology for detecting non-informative frames on two key observations: (1) non-informative frames
most often show an unrecognizable scene with few details and blurry edges and thus their information can be locally compressed in a few Discrete Cosine Transform (DCT) coefficients; however, informative images include much more details and their information content cannot be summarized by a small subset of DCT coefficients; (2) information content is spread all over the image in the case of informative frames, whereas in non-informative frames, depending on image artifacts and degradation factors, details may appear in only a few regions. We use the former observation in designing our global features and the latter in designing our local image features. We demonstrated that the suggested new features are superior to the existing features based on wavelet and Fourier transforms.
The third component of our quality assurance system is a 3D visualization system. The goal is to provide colonoscopists with feedback about the regions of the colon that have remained unexamined during colonoscopy, thereby helping them improve their navigational skills. The suggested system is based on a new 3D reconstruction algorithm that combines depth and position information for 3D reconstruction. We propose to use a depth camera and a tracking sensor to obtain depth and position information. Our system contrasts with the existing works where the depth and position information are unreliably estimated from the colonoscopy frames. We conducted a use case experiment, demonstrating that the suggested 3D visualization system can determine the unseen regions of the navigated environment. However, due to technology limitations, we were not able to evaluate our 3D visualization system using a phantom model of the colon.
Alzheimer's Disease (AD) is the most common form of dementia observed in elderly patients and has significant social-economic impact. There are many initiatives which aim to capture leading causes of AD. Several genetic, imaging, and biochemical markers are being explored to monitor progression of AD and explore treatment and detection options. The primary focus of this thesis is to identify key biomarkers to understand the pathogenesis and prognosis of Alzheimer's Disease. Feature selection is the process of finding a subset of relevant features to develop efficient and robust learning models. It is an active research topic in diverse areas such as computer vision, bioinformatics, information retrieval, chemical informatics, and computational finance. In this work, state of the art feature selection algorithms, such as Student's t-test, Relief-F, Information Gain, Gini Index, Chi-Square, Fisher Kernel Score, Kruskal-Wallis, Minimum Redundancy Maximum Relevance, and Sparse Logistic regression with Stability Selection have been extensively exploited to identify informative features for AD using data from Alzheimer's Disease Neuroimaging Initiative (ADNI). An integrative approach which uses blood plasma protein, Magnetic Resonance Imaging, and psychometric assessment scores biomarkers has been explored. This work also analyzes the techniques to handle unbalanced data and evaluate the efficacy of sampling techniques. Performance of feature selection algorithm is evaluated using the relevance of derived features and the predictive power of the algorithm using Random Forest and Support Vector Machine classifiers. Performance metrics such as Accuracy, Sensitivity and Specificity, and area under the Receiver Operating Characteristic curve (AUC) have been used for evaluation. The feature selection algorithms best suited to analyze AD proteomics data have been proposed. The key biomarkers distinguishing healthy and AD patients, Mild Cognitive Impairment (MCI) converters and non-converters, and healthy and MCI patients have been identified.
As we migrate into an era of personalized medicine, understanding how bio-molecules interact with one another to form cellular systems is one of the key focus areas of systems biology. Several challenges such as the dynamic nature of cellular systems, uncertainty due to environmental influences, and the heterogeneity between individual patients render this a difficult task. In the last decade, several algorithms have been proposed to elucidate cellular systems from data, resulting in numerous data-driven hypotheses. However, due to the large number of variables involved in the process, many of which are unknown or not measurable, such computational approaches often lead to a high proportion of false positives. This renders interpretation of the data-driven hypotheses extremely difficult. Consequently, a dismal proportion of these hypotheses are subject to further experimental validation, eventually limiting their potential to augment existing biological knowledge. This dissertation develops a framework of computational methods for the analysis of such data-driven hypotheses leveraging existing biological knowledge. Specifically, I show how biological knowledge can be mapped onto these hypotheses and subsequently augmented through novel hypotheses. Biological hypotheses are learnt in three levels of abstraction -- individual interactions, functional modules and relationships between pathways, corresponding to three complementary aspects of biological systems. The computational methods developed in this dissertation are applied to high throughput cancer data, resulting in novel hypotheses with potentially significant biological impact.
Gas turbine engine for aircraft propulsion represents one of the most physics-complex and safety-critical systems in the world. Its failure diagnostic is challenging due to the complexity of the model system, difficulty involved in practical testing and the infeasibility of creating homogeneous diagnostic performance evaluation criteria for the diverse engine makes.
NASA has designed and publicized a standard benchmark problem for propulsion engine gas path diagnostic that enables comparisons among different engine diagnostic approaches. Some traditional model-based approaches and novel purely data-driven approaches such as machine learning, have been applied to this problem.
This study focuses on a different machine learning approach to the diagnostic problem. Some most common machine learning techniques, such as support vector machine, multi-layer perceptron, and self-organizing map are used to help gain insight into the different engine failure modes from the perspective of big data. They are organically integrated to achieve good performance based on a good understanding of the complex dataset.
The study presents a new hierarchical machine learning structure to enhance classification accuracy in NASA's engine diagnostic benchmark problem. The designed hierarchical structure produces an average diagnostic accuracy of 73.6%, which outperforms comparable studies that were most recently published.
A major challenge in automated text analysis is that different words are used for related concepts. Analyzing text at the surface level would treat related concepts (i.e. actors, actions, targets, and victims) as different objects, potentially missing common narrative patterns. Generalized concepts are used to overcome this problem. Generalization may result into word sense disambiguation failing to find similarity. This is addressed by taking into account contextual synonyms. Concept discovery based on contextual synonyms reveal information about the semantic roles of the words leading to concepts. Merger engine generalize the concepts so that it can be used as features in learning algorithms.
Multidimensional data have various representations. Thanks to their simplicity in modeling multidimensional data and the availability of various mathematical tools (such as tensor decompositions) that support multi-aspect analysis of such data, tensors are increasingly being used in many application domains including scientific data management, sensor data management, and social network data analysis. Relational model, on the other hand, enables semantic manipulation of data using relational operators, such as projection, selection, Cartesian-product, and set operators. For many multidimensional data applications, tensor operations as well as relational operations need to be supported throughout the data life cycle. In this thesis, we introduce a tensor-based relational data model (TRM), which enables both tensor- based data analysis and relational manipulations of multidimensional data, and define tensor-relational operations on this model. Then we introduce a tensor-relational data management system, so called, TensorDB. TensorDB is based on TRM, which brings together relational algebraic operations (for data manipulation and integration) and tensor algebraic operations (for data analysis). We develop optimization strategies for tensor-relational operations in both in-memory and in-database TensorDB. The goal of the TRM and TensorDB is to serve as a single environment that supports the entire life cycle of data; that is, data can be manipulated, integrated, processed, and analyzed.
The healthcare system in this country is currently unacceptable. New technologies may contribute to reducing cost and improving outcomes. Early diagnosis and treatment represents the least risky option for addressing this issue. Such a technology needs to be inexpensive, highly sensitive, highly specific, and amenable to adoption in a clinic. This thesis explores an immunodiagnostic technology based on highly scalable, non-natural sequence peptide microarrays designed to profile the humoral immune response and address the healthcare problem. The primary aim of this thesis is to explore the ability of these arrays to map continuous (linear) epitopes. I discovered that using a technique termed subsequence analysis where epitopes could be decisively mapped to an eliciting protein with high success rate. This led to the discovery of novel linear epitopes from Plasmodium falciparum (Malaria) and Treponema palladium (Syphilis), as well as validation of previously discovered epitopes in Dengue and monoclonal antibodies. Next, I developed and tested a classification scheme based on Support Vector Machines for development of a Dengue Fever diagnostic, achieving higher sensitivity and specificity than current FDA approved techniques. The software underlying this method is available for download under the BSD license. Following this, I developed a kinetic model for immunosignatures and tested it against existing data driven by previously unexplained phenomena. This model provides a framework and informs ways to optimize the platform for maximum stability and efficiency. I also explored the role of sequence composition in explaining an immunosignature binding profile, determining a strong role for charged residues that seems to have some predictive ability for disease. Finally, I developed a database, software and indexing strategy based on Apache Lucene for searching motif patterns (regular expressions) in large biological databases. These projects as a whole have advanced knowledge of how to approach high throughput immunodiagnostics and provide an example of how technology can be fused with biology in order to affect scientific and health outcomes.
Research in the learning sciences suggests that students learn better by collaborating with their peers than learning individually. Students working together as a group tend to generate new ideas more frequently and exhibit a higher level of reasoning. In this internet age with the advent of massive open online courses (MOOCs), students across the world are able to access and learn material remotely. This creates a need for tools that support distant or remote collaboration. In order to build such tools we need to understand the basic elements of remote collaboration and how it differs from traditional face-to-face collaboration.
The main goal of this thesis is to explore how spoken dialogue varies in face-to-face and remote collaborative learning settings. Speech data is collected from student participants solving mathematical problems collaboratively on a tablet. Spoken dialogue is analyzed based on conversational and acoustic features in both the settings. Looking for collaborative differences of transactivity and dialogue initiative, both settings are compared in detail using machine learning classification techniques based on acoustic and prosodic features of speech. Transactivity is defined as a joint construction of knowledge by peers. The main contributions of this thesis are: a speech corpus to analyze spoken dialogue in face-to-face and remote settings and an empirical analysis of conversation, collaboration, and speech prosody in both the settings. The results from the experiments show that amount of overlap is lower in remote dialogue than in the face-to-face setting. There is a significant difference in transactivity among strangers. My research benefits the computer-supported collaborative learning community by providing an analysis that can be used to build more efficient tools for supporting remote collaborative learning.