Matching Items (41)

158771-Thumbnail Image.png

Fine Mapping Functional Noncoding Genetic Elements Via Machine Learning

Description

All biological processes like cell growth, cell differentiation, development, and aging requires a series of steps which are characterized by gene regulation. Studies have shown that gene regulation is the

All biological processes like cell growth, cell differentiation, development, and aging requires a series of steps which are characterized by gene regulation. Studies have shown that gene regulation is the key to various traits and diseases. Various factors affect the gene regulation which includes genetic signals, epigenetic tracks, genetic variants, etc. Deciphering and cataloging these functional genetic elements in the non-coding regions of the genome is one of the biggest challenges in precision medicine and genetic research. This thesis presents two different approaches to identifying these elements: TreeMap and DeepCORE. The first approach involves identifying putative causal genetic variants in cis-eQTL accounting for multisite effects and genetic linkage at a locus. TreeMap performs an organized search for individual and multiple causal variants using a tree guided nested machine learning method. DeepCORE on the other hand explores novel deep learning techniques that models the relationship between genetic, epigenetic and transcriptional patterns across tissues and cell lines and identifies co-operative regulatory elements that affect gene regulation. These two methods are believed to be the link for genotype-phenotype association and a necessary step to explaining various complex diseases and missing heritability.

Contributors

Agent

Created

Date Created
  • 2020

158743-Thumbnail Image.png

Operational Safety Verification of AI­-Enabled Cyber-­Physical Systems

Description

One of the main challenges in testing artificial intelligence (AI) enabled cyber physicalsystems (CPS) such as autonomous driving systems and internet­-of-­things (IoT) medical
devices is the presence of machine learning

One of the main challenges in testing artificial intelligence (AI) enabled cyber physicalsystems (CPS) such as autonomous driving systems and internet­-of-­things (IoT) medical
devices is the presence of machine learning components, for which formal properties are
difficult to establish. In addition, operational components interaction circumstances, inclusion of human­-in-­the-­loop, and environmental changes result in a myriad of safety concerns
all of which may not only be comprehensibly tested before deployment but also may not
even have been detected during design and testing phase. This dissertation identifies major challenges of safety verification of AI­-enabled safety critical systems and addresses the
safety problem by proposing an operational safety verification technique which relies on
solving the following subproblems:
1. Given Input/Output operational traces collected from sensors/actuators, automatically
learn a hybrid automata (HA) representation of the AI-­enabled CPS.
2. Given the learned HA, evaluate the operational safety of AI­-enabled CPS in the field.
This dissertation presents novel approaches for learning hybrid automata model from time
series traces collected from the operation of the AI­-enabled CPS in the real world for linear
and non­linear CPS. The learned model allows operational safety to be stringently evaluated
by comparing the learned HA model against a reference specifications model of the system.
The proposed techniques are evaluated on the artificial pancreas control system

Contributors

Agent

Created

Date Created
  • 2020

152494-Thumbnail Image.png

Design, analytics and quality assurance for emerging personalized clinical diagnostics based on next-gen sequencing

Description

Major advancements in biology and medicine have been realized during recent decades, including massively parallel sequencing, which allows researchers to collect millions or billions of short reads from a DNA

Major advancements in biology and medicine have been realized during recent decades, including massively parallel sequencing, which allows researchers to collect millions or billions of short reads from a DNA or RNA sample. This capability opens the door to a renaissance in personalized medicine if effectively deployed. Three projects that address major and necessary advancements in massively parallel sequencing are included in this dissertation. The first study involves a pair of algorithms to verify patient identity based on single nucleotide polymorphisms (SNPs). In brief, we developed a method that allows de novo construction of sample relationships, e.g., which ones are from the same individuals and which are from different individuals. We also developed a method to confirm the hypothesis that a tumor came from a known individual. The second study derives an algorithm to multiplex multiple Polymerase Chain Reaction (PCR) reactions, while minimizing interference between reactions that compromise results. PCR is a powerful technique that amplifies pre-determined regions of DNA and is often used to selectively amplify DNA and RNA targets that are destined for sequencing. It is highly desirable to multiplex reactions to save on reagent and assay setup costs as well as equalize the effect of minor handling issues across gene targets. Our solution involves a binary integer program that minimizes events that are likely to cause interference between PCR reactions. The third study involves design and analysis methods required to analyze gene expression and copy number results against a reference range in a clinical setting for guiding patient treatments. Our goal is to determine which events are present in a given tumor specimen. These events may be mutation, DNA copy number or RNA expression. All three techniques are being used in major research and diagnostic projects for their intended purpose at the time of writing this manuscript. The SNP matching solution has been selected by The Cancer Genome Atlas to determine sample identity. Paradigm Diagnostics, Viomics and International Genomics Consortium utilize the PCR multiplexing technique to multiplex various types of PCR reactions on multi-million dollar projects. The reference range-based normalization method is used by Paradigm Diagnostics to analyze results from every patient.

Contributors

Agent

Created

Date Created
  • 2014

151341-Thumbnail Image.png

Spatio-temporal data mining to detect changes and clusters in trajectories

Description

With the rapid development of mobile sensing technologies like GPS, RFID, sensors in smartphones, etc., capturing position data in the form of trajectories has become easy. Moving object trajectory analysis

With the rapid development of mobile sensing technologies like GPS, RFID, sensors in smartphones, etc., capturing position data in the form of trajectories has become easy. Moving object trajectory analysis is a growing area of interest these days owing to its applications in various domains such as marketing, security, traffic monitoring and management, etc. To better understand movement behaviors from the raw mobility data, this doctoral work provides analytic models for analyzing trajectory data. As a first contribution, a model is developed to detect changes in trajectories with time. If the taxis moving in a city are viewed as sensors that provide real time information of the traffic in the city, a change in these trajectories with time can reveal that the road network has changed. To detect changes, trajectories are modeled with a Hidden Markov Model (HMM). A modified training algorithm, for parameter estimation in HMM, called m-BaumWelch, is used to develop likelihood estimates under assumed changes and used to detect changes in trajectory data with time. Data from vehicles are used to test the method for change detection. Secondly, sequential pattern mining is used to develop a model to detect changes in frequent patterns occurring in trajectory data. The aim is to answer two questions: Are the frequent patterns still frequent in the new data? If they are frequent, has the time interval distribution in the pattern changed? Two different approaches are considered for change detection, frequency-based approach and distribution-based approach. The methods are illustrated with vehicle trajectory data. Finally, a model is developed for clustering and outlier detection in semantic trajectories. A challenge with clustering semantic trajectories is that both numeric and categorical attributes are present. Another problem to be addressed while clustering is that trajectories can be of different lengths and also have missing values. A tree-based ensemble is used to address these problems. The approach is extended to outlier detection in semantic trajectories.

Contributors

Agent

Created

Date Created
  • 2012

154115-Thumbnail Image.png

Optimal design of experiments for functional responses

Description

Functional or dynamic responses are prevalent in experiments in the fields of engineering, medicine, and the sciences, but proposals for optimal designs are still sparse for this type of response.

Functional or dynamic responses are prevalent in experiments in the fields of engineering, medicine, and the sciences, but proposals for optimal designs are still sparse for this type of response. Experiments with dynamic responses result in multiple responses taken over a spectrum variable, so the design matrix for a dynamic response have more complicated structures. In the literature, the optimal design problem for some functional responses has been solved using genetic algorithm (GA) and approximate design methods. The goal of this dissertation is to develop fast computer algorithms for calculating exact D-optimal designs.

First, we demonstrated how the traditional exchange methods could be improved to generate a computationally efficient algorithm for finding G-optimal designs. The proposed two-stage algorithm, which is called the cCEA, uses a clustering-based approach to restrict the set of possible candidates for PEA, and then improves the G-efficiency using CEA.

The second major contribution of this dissertation is the development of fast algorithms for constructing D-optimal designs that determine the optimal sequence of stimuli in fMRI studies. The update formula for the determinant of the information matrix was improved by exploiting the sparseness of the information matrix, leading to faster computation times. The proposed algorithm outperforms genetic algorithm with respect to computational efficiency and D-efficiency.

The third contribution is a study of optimal experimental designs for more general functional response models. First, the B-spline system is proposed to be used as the non-parametric smoother of response function and an algorithm is developed to determine D-optimal sampling points of a spectrum variable. Second, we proposed a two-step algorithm for finding the optimal design for both sampling points and experimental settings. In the first step, the matrix of experimental settings is held fixed while the algorithm optimizes the determinant of the information matrix for a mixed effects model to find the optimal sampling times. In the second step, the optimal sampling times obtained from the first step is held fixed while the algorithm iterates on the information matrix to find the optimal experimental settings. The designs constructed by this approach yield superior performance over other designs found in literature.

Contributors

Agent

Created

Date Created
  • 2015

152902-Thumbnail Image.png

Simulation-based Bayesian optimal accelerated life test design and model discrimination

Description

Accelerated life testing (ALT) is the process of subjecting a product to stress conditions (temperatures, voltage, pressure etc.) in excess of its normal operating levels to accelerate failures. Product failure

Accelerated life testing (ALT) is the process of subjecting a product to stress conditions (temperatures, voltage, pressure etc.) in excess of its normal operating levels to accelerate failures. Product failure typically results from multiple stresses acting on it simultaneously. Multi-stress factor ALTs are challenging as they increase the number of experiments due to the stress factor-level combinations resulting from the increased number of factors. Chapter 2 provides an approach for designing ALT plans with multiple stresses utilizing Latin hypercube designs that reduces the simulation cost without loss of statistical efficiency. A comparison to full grid and large-sample approximation methods illustrates the approach computational cost gain and flexibility in determining optimal stress settings with less assumptions and more intuitive unit allocations.

Implicit in the design criteria of current ALT designs is the assumption that the form of the acceleration model is correct. This is unrealistic assumption in many real-world problems. Chapter 3 provides an approach for ALT optimum design for model discrimination. We utilize the Hellinger distance measure between predictive distributions. The optimal ALT plan at three stress levels was determined and its performance was compared to good compromise plan, best traditional plan and well-known 4:2:1 compromise test plans. In the case of linear versus quadratic ALT models, the proposed method increased the test plan's ability to distinguish among competing models and provided better guidance as to which model is appropriate for the experiment.

Chapter 4 extends the approach of Chapter 3 to ALT sequential model discrimination. An initial experiment is conducted to provide maximum possible information with respect to model discrimination. The follow-on experiment is planned by leveraging the most current information to allow for Bayesian model comparison through posterior model probability ratios. Results showed that performance of plan is adversely impacted by the amount of censoring in the data, in the case of linear vs. quadratic model form at three levels of constant stress, sequential testing can improve model recovery rate by approximately 8% when data is complete, but no apparent advantage in adopting sequential testing was found in the case of right-censored data when censoring is in excess of a certain amount.

Contributors

Agent

Created

Date Created
  • 2014

152992-Thumbnail Image.png

Cascading curtainmap: an interactive visualization for depicting large and flexible hierarchies

Description

In visualizing information hierarchies, icicle plots are efficient diagrams in that they provide the user a straightforward layout for different levels of data in a hierarchy and enable the user

In visualizing information hierarchies, icicle plots are efficient diagrams in that they provide the user a straightforward layout for different levels of data in a hierarchy and enable the user to compare items based on the item width. However, as the size of the hierarchy grows large, the items in an icicle plot end up being small and indistinguishable. In this thesis, by maintaining the positive characteristics of traditional

icicle plots and incorporating new features such as dynamic diagram and active layer, we developed an interactive visualization that allows the user to selectively drill down or roll up to review different levels of data in a large hierarchy, to change the hierarchical

structure to detect potential patterns, and to maintain an overall understanding of the

current hierarchical structure.

Contributors

Agent

Created

Date Created
  • 2014

Dynamic management of inspection effort allocation in an international port of entry (POE)

Description

Every year, more than 11 million maritime containers and 11 million commercial trucks arrive to the United States, carrying all types of imported goods. As it would be costly to

Every year, more than 11 million maritime containers and 11 million commercial trucks arrive to the United States, carrying all types of imported goods. As it would be costly to inspect every container, only a fraction of them are inspected before being allowed to proceed into the United States. This dissertation proposes a decision support system that aims to allocate the scarce inspection resources at a land POE (L-POE), to minimize the different costs associated with the inspection process, including those associated with delaying the entry of legitimate imports. Given the ubiquity of sensors in all aspects of the supply chain, it is necessary to have automated decision systems that incorporate the information provided by these sensors and other possible channels into the inspection planning process. The inspection planning system proposed in this dissertation decomposes the inspection effort allocation process into two phases: Primary and detailed inspection planning. The former helps decide what to inspect, and the latter how to conduct the inspections. A multi-objective optimization (MOO) model is developed for primary inspection planning. This model tries to balance the costs of conducting inspections, direct and expected, and the waiting time of the trucks. The resulting model is exploited in two different ways: One is to construct a complete or a partial efficient frontier for the MOO model with diversity of Pareto-optimal solutions maximized; the other is to evaluate a given inspection plan and provide possible suggestions for improvement. The methodologies are described in detail and case studies provided. The case studies show that this MOO based primary planning model can effectively pick out the non-conforming trucks to inspect, while balancing the costs and waiting time.

Contributors

Agent

Created

Date Created
  • 2012

151176-Thumbnail Image.png

Novel statistical models for complex data structures

Description

Rapid advance in sensor and information technology has resulted in both spatially and temporally data-rich environment, which creates a pressing need for us to develop novel statistical methods and the

Rapid advance in sensor and information technology has resulted in both spatially and temporally data-rich environment, which creates a pressing need for us to develop novel statistical methods and the associated computational tools to extract intelligent knowledge and informative patterns from these massive datasets. The statistical challenges for addressing these massive datasets lay in their complex structures, such as high-dimensionality, hierarchy, multi-modality, heterogeneity and data uncertainty. Besides the statistical challenges, the associated computational approaches are also considered essential in achieving efficiency, effectiveness, as well as the numerical stability in practice. On the other hand, some recent developments in statistics and machine learning, such as sparse learning, transfer learning, and some traditional methodologies which still hold potential, such as multi-level models, all shed lights on addressing these complex datasets in a statistically powerful and computationally efficient way. In this dissertation, we identify four kinds of general complex datasets, including "high-dimensional datasets", "hierarchically-structured datasets", "multimodality datasets" and "data uncertainties", which are ubiquitous in many domains, such as biology, medicine, neuroscience, health care delivery, manufacturing, etc. We depict the development of novel statistical models to analyze complex datasets which fall under these four categories, and we show how these models can be applied to some real-world applications, such as Alzheimer's disease research, nursing care process, and manufacturing.

Contributors

Agent

Created

Date Created
  • 2012

151226-Thumbnail Image.png

Modeling time series data for supervised learning

Description

Temporal data are increasingly prevalent and important in analytics. Time series (TS) data are chronological sequences of observations and an important class of temporal data. Fields such as medicine, finance,

Temporal data are increasingly prevalent and important in analytics. Time series (TS) data are chronological sequences of observations and an important class of temporal data. Fields such as medicine, finance, learning science and multimedia naturally generate TS data. Each series provide a high-dimensional data vector that challenges the learning of the relevant patterns This dissertation proposes TS representations and methods for supervised TS analysis. The approaches combine new representations that handle translations and dilations of patterns with bag-of-features strategies and tree-based ensemble learning. This provides flexibility in handling time-warped patterns in a computationally efficient way. The ensemble learners provide a classification framework that can handle high-dimensional feature spaces, multiple classes and interaction between features. The proposed representations are useful for classification and interpretation of the TS data of varying complexity. The first contribution handles the problem of time warping with a feature-based approach. An interval selection and local feature extraction strategy is proposed to learn a bag-of-features representation. This is distinctly different from common similarity-based time warping. This allows for additional features (such as pattern location) to be easily integrated into the models. The learners have the capability to account for the temporal information through the recursive partitioning method. The second contribution focuses on the comprehensibility of the models. A new representation is integrated with local feature importance measures from tree-based ensembles, to diagnose and interpret time intervals that are important to the model. Multivariate time series (MTS) are especially challenging because the input consists of a collection of TS and both features within TS and interactions between TS can be important to models. Another contribution uses a different representation to produce computationally efficient strategies that learn a symbolic representation for MTS. Relationships between the multiple TS, nominal and missing values are handled with tree-based learners. Applications such as speech recognition, medical diagnosis and gesture recognition are used to illustrate the methods. Experimental results show that the TS representations and methods provide better results than competitive methods on a comprehensive collection of benchmark datasets. Moreover, the proposed approaches naturally provide solutions to similarity analysis, predictive pattern discovery and feature selection.

Contributors

Agent

Created

Date Created
  • 2012