Matching Items (41)
152398-Thumbnail Image.png
Description
Identifying important variation patterns is a key step to identifying root causes of process variability. This gives rise to a number of challenges. First, the variation patterns might be non-linear in the measured variables, while the existing research literature has focused on linear relationships. Second, it is important to remove

Identifying important variation patterns is a key step to identifying root causes of process variability. This gives rise to a number of challenges. First, the variation patterns might be non-linear in the measured variables, while the existing research literature has focused on linear relationships. Second, it is important to remove noise from the dataset in order to visualize the true nature of the underlying patterns. Third, in addition to visualizing the pattern (preimage), it is also essential to understand the relevant features that define the process variation pattern. This dissertation considers these variation challenges. A base kernel principal component analysis (KPCA) algorithm transforms the measurements to a high-dimensional feature space where non-linear patterns in the original measurement can be handled through linear methods. However, the principal component subspace in feature space might not be well estimated (especially from noisy training data). An ensemble procedure is constructed where the final preimage is estimated as the average from bagged samples drawn from the original dataset to attenuate noise in kernel subspace estimation. This improves the robustness of any base KPCA algorithm. In a second method, successive iterations of denoising a convex combination of the training data and the corresponding denoised preimage are used to produce a more accurate estimate of the actual denoised preimage for noisy training data. The number of primary eigenvectors chosen in each iteration is also decreased at a constant rate. An efficient stopping rule criterion is used to reduce the number of iterations. A feature selection procedure for KPCA is constructed to find the set of relevant features from noisy training data. Data points are projected onto sparse random vectors. Pairs of such projections are then matched, and the differences in variation patterns within pairs are used to identify the relevant features. This approach provides robustness to irrelevant features by calculating the final variation pattern from an ensemble of feature subsets. Experiments are conducted using several simulated as well as real-life data sets. The proposed methods show significant improvement over the competitive methods.
ContributorsSahu, Anshuman (Author) / Runger, George C. (Thesis advisor) / Wu, Teresa (Committee member) / Pan, Rong (Committee member) / Maciejewski, Ross (Committee member) / Arizona State University (Publisher)
Created2013
152124-Thumbnail Image.png
Description
The poor energy efficiency of buildings is a major barrier to alleviating the energy dilemma. Historically, monthly utility billing data was widely available and analytical methods for identifying building energy efficiency improvements, performing building Monitoring and Verification (M&V;) and continuous commissioning (CCx) were based on them. Although robust, these methods

The poor energy efficiency of buildings is a major barrier to alleviating the energy dilemma. Historically, monthly utility billing data was widely available and analytical methods for identifying building energy efficiency improvements, performing building Monitoring and Verification (M&V;) and continuous commissioning (CCx) were based on them. Although robust, these methods were not sensitive enough to detect a number of common causes for increased energy use. In recent years, prevalence of short-term building energy consumption data, also known as Energy Interval Data (EID), made available through the Smart Meters, along with data mining techniques presents the potential of knowledge discovery inherent in this data. This allows more sophisticated analytical tools to be developed resulting in greater sensitivities due to higher prediction accuracies; leading to deep energy savings and highly efficient building system operations. The research explores enhancements to Inverse Statistical Modeling techniques due to the availability of EID. Inverse statistical modeling is the process of identification of prediction model structure and estimates of model parameters. The methodology is based on several common statistical and data mining techniques: cluster analysis for day typing, outlier detection and removal, and generation of building scheduling. Inverse methods are simpler to develop and require fewer inputs for model identification. They can model changes in energy consumption based on changes in climatic variables and up to a certain extent, occupancy. This makes them easy-to-use and appealing to building managers for evaluating any general retrofits, building condition monitoring, continuous commissioning and short-term load forecasting (STLF). After evaluating several model structures, an elegant model form was derived which can be used to model daily energy consumption; which can be extended to model energy consumption for any specific hour by adding corrective terms. Additionally, adding AR terms to this model makes it usable for STLF. Two different buildings, one synthetic (ASHRAE medium-office prototype) building and another, an actual office building, were modeled using these techniques. The methodologies proposed have several novel features compared to the manner in which these models have been described earlier. Finally, this thesis investigates characteristic fault signature identification from detailed simulation models and subsequent inverse analysis.
ContributorsJalori, Saurabh (Author) / Reddy, T. Agami (Thesis advisor) / Bryan, Harvey (Committee member) / Runger, George C. (Committee member) / Arizona State University (Publisher)
Created2013
151813-Thumbnail Image.png
Description
Improving the quality of Origin-Destination (OD) demand estimates increases the effectiveness of design, evaluation and implementation of traffic planning and management systems. The associated bilevel Sensor Location Flow-Estimation problem considers two important research questions: (1) how to compute the best estimates of the flows of interest by using anticipated data

Improving the quality of Origin-Destination (OD) demand estimates increases the effectiveness of design, evaluation and implementation of traffic planning and management systems. The associated bilevel Sensor Location Flow-Estimation problem considers two important research questions: (1) how to compute the best estimates of the flows of interest by using anticipated data from given candidate sensors location; and (2) how to decide on the optimum subset of links where sensors should be located. In this dissertation, a decision framework is developed to optimally locate and obtain high quality OD volume estimates in vehicular traffic networks. The framework includes a traffic assignment model to load the OD traffic volumes on routes in a known choice set, a sensor location model to decide on which subset of links to locate counting sensors to observe traffic volumes, and an estimation model to obtain best estimates of OD or route flow volumes. The dissertation first addresses the deterministic route flow estimation problem given apriori knowledge of route flows and their uncertainties. Two procedures are developed to locate "perfect" and "noisy" sensors respectively. Next, it addresses a stochastic route flow estimation problem. A hierarchical linear Bayesian model is developed, where the real route flows are assumed to be generated from a Multivariate Normal distribution with two parameters: "mean" and "variance-covariance matrix". The prior knowledge for the "mean" parameter is described by a probability distribution. When assuming the "variance-covariance matrix" parameter is known, a Bayesian A-optimal design is developed. When the "variance-covariance matrix" parameter is unknown, Markov Chain Monte Carlo approach is used to estimate the aposteriori quantities. In all the sensor location model the objective is the maximization of the reduction in the variances of the distribution of the estimates of the OD volume. Developed models are compared with other available models in the literature. The comparison showed that the models developed performed better than available models.
ContributorsWang, Ning (Author) / Mirchandani, Pitu (Thesis advisor) / Murray, Alan (Committee member) / Pendyala, Ram (Committee member) / Runger, George C. (Committee member) / Zhang, Muhong (Committee member) / Arizona State University (Publisher)
Created2013
149443-Thumbnail Image.png
Description
Public health surveillance is a special case of the general problem where counts (or rates) of events are monitored for changes. Modern data complements event counts with many additional measurements (such as geographic, demographic, and others) that comprise high-dimensional covariates. This leads to an important challenge to detect a change

Public health surveillance is a special case of the general problem where counts (or rates) of events are monitored for changes. Modern data complements event counts with many additional measurements (such as geographic, demographic, and others) that comprise high-dimensional covariates. This leads to an important challenge to detect a change that only occurs within a region, initially unspecified, defined by these covariates. Current methods are typically limited to spatial and/or temporal covariate information and often fail to use all the information available in modern data that can be paramount in unveiling these subtle changes. Additional complexities associated with modern health data that are often not accounted for by traditional methods include: covariates of mixed type, missing values, and high-order interactions among covariates. This work proposes a transform of public health surveillance to supervised learning, so that an appropriate learner can inherently address all the complexities described previously. At the same time, quantitative measures from the learner can be used to define signal criteria to detect changes in rates of events. A Feature Selection (FS) method is used to identify covariates that contribute to a model and to generate a signal. A measure of statistical significance is included to control false alarms. An alternative Percentile method identifies the specific cases that lead to changes using class probability estimates from tree-based ensembles. This second method is intended to be less computationally intensive and significantly simpler to implement. Finally, a third method labeled Rule-Based Feature Value Selection (RBFVS) is proposed for identifying the specific regions in high-dimensional space where the changes are occurring. Results on simulated examples are used to compare the FS method and the Percentile method. Note this work emphasizes the application of the proposed methods on public health surveillance. Nonetheless, these methods can easily be extended to a variety of applications where counts (or rates) of events are monitored for changes. Such problems commonly occur in domains such as manufacturing, economics, environmental systems, engineering, as well as in public health.
ContributorsDavila, Saylisse (Author) / Runger, George C. (Thesis advisor) / Montgomery, Douglas C. (Committee member) / Young, Dennis (Committee member) / Gel, Esma (Committee member) / Arizona State University (Publisher)
Created2010
156520-Thumbnail Image.png
Description
Study of canine cancer’s molecular underpinnings holds great potential for informing veterinary and human oncology. Sporadic canine cancers are highly abundant (~4 million diagnoses/year in the United States) and the dog’s unique genomic architecture due to selective inbreeding, alongside the high similarity between dog and human genomes both confer power

Study of canine cancer’s molecular underpinnings holds great potential for informing veterinary and human oncology. Sporadic canine cancers are highly abundant (~4 million diagnoses/year in the United States) and the dog’s unique genomic architecture due to selective inbreeding, alongside the high similarity between dog and human genomes both confer power for improving understanding of cancer genes. However, characterization of canine cancer genome landscapes has been limited. It is hindered by lack of canine-specific tools and resources. To enable robust and reproducible comparative genomic analysis of canine cancers, I have developed a workflow for somatic and germline variant calling in canine cancer genomic data. I have first adapted a human cancer genomics pipeline to create a semi-automated canine pipeline used to map genomic landscapes of canine melanoma, lung adenocarcinoma, osteosarcoma and lymphoma. This pipeline also forms the backbone of my novel comparative genomics workflow.

Practical impediments to comparative genomic analysis of dog and human include challenges identifying similarities in mutation type and function across species. For example, canine genes could have evolved different functions and their human orthologs may perform different functions. Hence, I undertook a systematic statistical evaluation of dog and human cancer genes and assessed functional similarities and differences between orthologs to improve understanding of the roles of these genes in cancer across species. I tested this pipeline canine and human Diffuse Large B-Cell Lymphoma (DLBCL), given that canine DLBCL is the most comprehensively genomically characterized canine cancer. Logistic regression with genes bearing somatic coding mutations in each cancer was used to determine if conservation metrics (sequence identity, network placement, etc.) could explain co-mutation of genes in both species. Using this model, I identified 25 co-mutated and evolutionarily similar genes that may be compelling cross-species cancer genes. For example, PCLO was identified as a co-mutated conserved gene with PCLO having been previously identified as recurrently mutated in human DLBCL, but with an unclear role in oncogenesis. Further investigation of these genes might shed new light on the biology of lymphoma in dogs and human and this approach may more broadly serve to prioritize new genes for comparative cancer biology studies.
ContributorsSivaprakasam, Karthigayini (Author) / Dinu, Valentin (Thesis advisor) / Trent, Jeffrey (Thesis advisor) / Hendricks, William (Committee member) / Runger, George C. (Committee member) / Arizona State University (Publisher)
Created2018
157879-Thumbnail Image.png
Description
Accounting for over a third of all emerging and re-emerging infections, viruses represent a major public health threat, which researchers and epidemiologists across the world have been attempting to contain for decades. Recently, genomics-based surveillance of viruses through methods such as virus phylogeography has grown into a popular tool for

Accounting for over a third of all emerging and re-emerging infections, viruses represent a major public health threat, which researchers and epidemiologists across the world have been attempting to contain for decades. Recently, genomics-based surveillance of viruses through methods such as virus phylogeography has grown into a popular tool for infectious disease monitoring. When conducting such surveillance studies, researchers need to manually retrieve geographic metadata denoting the location of infected host (LOIH) of viruses from public sequence databases such as GenBank and any publication related to their study. The large volume of semi-structured and unstructured information that must be reviewed for this task, along with the ambiguity of geographic locations, make it especially challenging. Prior work has demonstrated that the majority of GenBank records lack sufficient geographic granularity concerning the LOIH of viruses. As a result, reviewing full-text publications is often necessary for conducting in-depth analysis of virus migration, which can be a very time-consuming process. Moreover, integrating geographic metadata pertaining to the LOIH of viruses from different sources, including different fields in GenBank records as well as full-text publications, and normalizing the integrated metadata to unique identifiers for subsequent analysis, are also challenging tasks, often requiring expert domain knowledge. Therefore, automated information extraction (IE) methods could help significantly accelerate this process, positively impacting public health research. However, very few research studies have attempted the use of IE methods in this domain.

This work explores the use of novel knowledge-driven geographic IE heuristics for extracting, integrating, and normalizing the LOIH of viruses based on information available in GenBank and related publications; when evaluated on manually annotated test sets, the methods were found to have a high accuracy and shown to be adequate for addressing this challenging problem. It also presents GeoBoost, a pioneering software system for georeferencing GenBank records, as well as a large-scale database containing over two million virus GenBank records georeferenced using the algorithms introduced here. The methods, database and software developed here could help support diverse public health domains focusing on sequence-informed virus surveillance, thereby enhancing existing platforms for controlling and containing disease outbreaks.
ContributorsTahsin, Tasnia (Author) / Gonzalez, Graciela (Thesis advisor) / Scotch, Matthew (Thesis advisor) / Runger, George C. (Committee member) / Arizona State University (Publisher)
Created2019
158694-Thumbnail Image.png
Description
In conventional supervised learning tasks, information retrieval from extensive collections of data happens automatically at low cost, whereas in many real-world problems obtaining labeled data can be hard, time-consuming, and expensive. Consider healthcare systems, for example, where unlabeled medical images are abundant while labeling requires a considerable amount of knowledge

In conventional supervised learning tasks, information retrieval from extensive collections of data happens automatically at low cost, whereas in many real-world problems obtaining labeled data can be hard, time-consuming, and expensive. Consider healthcare systems, for example, where unlabeled medical images are abundant while labeling requires a considerable amount of knowledge from experienced physicians. Active learning addresses this challenge with an iterative process to select instances from the unlabeled data to annotate and improve the supervised learner. At each step, the query of examples to be labeled can be considered as a dilemma between exploitation of the supervised learner's current knowledge and exploration of the unlabeled input features.

Motivated by the need for efficient active learning strategies, this dissertation proposes new algorithms for batch-mode, pool-based active learning. The research considers the following questions: how can unsupervised knowledge of the input features (exploration) improve learning when incorporated with supervised learning (exploitation)? How to characterize exploration in active learning when data is high-dimensional? Finally, how to adaptively make a balance between exploration and exploitation?

The first contribution proposes a new active learning algorithm, Cluster-based Stochastic Query-by-Forest (CSQBF), which provides a batch-mode strategy that accelerates learning with added value from exploration and improved exploitation scores. CSQBF balances exploration and exploitation using a probabilistic scoring criterion based on classification probabilities from a tree-based ensemble model within each data cluster.

The second contribution introduces two more query strategies, Double Margin Active Learning (DMAL) and Cluster Agnostic Active Learning (CAAL), that combine consistent exploration and exploitation modules into a coherent and unified measure for label query. Instead of assuming a fixed clustering structure, CAAL and DMAL adopt a soft-clustering strategy which provides a new approach to formalize exploration in active learning.

The third contribution addresses the challenge of dynamically making a balance between exploration and exploitation criteria throughout the active learning process. Two adaptive algorithms are proposed based on feedback-driven bandit optimization frameworks that elegantly handle this issue by learning the relationship between exploration-exploitation trade-off and an active learner's performance.
ContributorsShams, Ghazal (Author) / Runger, George C. (Thesis advisor) / Montgomery, Douglas C. (Committee member) / Escobedo, Adolfo (Committee member) / Pedrielli, Giulia (Committee member) / Arizona State University (Publisher)
Created2020
158704-Thumbnail Image.png
Description
The Cognitive Decision Support (CDS) model is proposed. The model is widely applicable and scales to realistic, complex decision problems based on adaptive learning. The utility of a decision is discussed and four types of decisions associated with CDS model are identified. The CDS model is designed to learn decision

The Cognitive Decision Support (CDS) model is proposed. The model is widely applicable and scales to realistic, complex decision problems based on adaptive learning. The utility of a decision is discussed and four types of decisions associated with CDS model are identified. The CDS model is designed to learn decision utilities. Data enrichment is introduced to promote the effectiveness of learning. Grouping is introduced for large-scale decision learning. Introspection and adjustment are presented for adaptive learning. Triage recommendation is incorporated to indicate the trustworthiness of suggested decisions.

The CDS model and methodologies are integrated into an architecture using concepts from cognitive computing. The proposed architecture is implemented with an example use case to inventory management.

Reinforcement learning (RL) is discussed as an alternative, generalized adaptive learning engine for the CDS system to handle the complexity of many problems with unknown environments. An adaptive state dimension with context that can increase with newly available information is discussed. Several enhanced components for RL which are critical for complex use cases are integrated. Deep Q networks are embedded with the adaptive learning methodologies and applied to an example supply chain management problem on capacity planning.

A new approach using Ito stochastic processes is proposed as a more generalized method to generate non-stationary demands in various patterns that can be used in decision problems. The proposed method generates demands with varying non-stationary patterns, including trend, cyclical, seasonal, and irregular patterns. Conventional approaches are identified as special cases of the proposed method. Demands are illustrated in realistic settings for various decision models. Various statistical criteria are applied to filter the generated demands. The method is applied to a real-world example.
ContributorsKee, Seho (Author) / Runger, George C. (Thesis advisor) / Escobedo, Adolfo (Committee member) / Gel, Esma (Committee member) / Janakiram, Mani (Committee member) / Rogers, Dale (Committee member) / Arizona State University (Publisher)
Created2020
158743-Thumbnail Image.png
Description
One of the main challenges in testing artificial intelligence (AI) enabled cyber physicalsystems (CPS) such as autonomous driving systems and internet­-of-­things (IoT) medical
devices is the presence of machine learning components, for which formal properties are
difficult to establish. In addition, operational components interaction circumstances, inclusion of human­-in-­the-­loop, and environmental changes result

One of the main challenges in testing artificial intelligence (AI) enabled cyber physicalsystems (CPS) such as autonomous driving systems and internet­-of-­things (IoT) medical
devices is the presence of machine learning components, for which formal properties are
difficult to establish. In addition, operational components interaction circumstances, inclusion of human­-in-­the-­loop, and environmental changes result in a myriad of safety concerns
all of which may not only be comprehensibly tested before deployment but also may not
even have been detected during design and testing phase. This dissertation identifies major challenges of safety verification of AI­-enabled safety critical systems and addresses the
safety problem by proposing an operational safety verification technique which relies on
solving the following subproblems:
1. Given Input/Output operational traces collected from sensors/actuators, automatically
learn a hybrid automata (HA) representation of the AI-­enabled CPS.
2. Given the learned HA, evaluate the operational safety of AI­-enabled CPS in the field.
This dissertation presents novel approaches for learning hybrid automata model from time
series traces collected from the operation of the AI­-enabled CPS in the real world for linear
and non­linear CPS. The learned model allows operational safety to be stringently evaluated
by comparing the learned HA model against a reference specifications model of the system.
The proposed techniques are evaluated on the artificial pancreas control system
ContributorsLamrani, Imane (Author) / Gupta, Sandeep Ks (Thesis advisor) / Banerjee, Ayan (Committee member) / Zhang, Yi (Committee member) / Runger, George C. (Committee member) / Rodriguez, Armando (Committee member) / Arizona State University (Publisher)
Created2020
158771-Thumbnail Image.png
Description
All biological processes like cell growth, cell differentiation, development, and aging requires a series of steps which are characterized by gene regulation. Studies have shown that gene regulation is the key to various traits and diseases. Various factors affect the gene regulation which includes genetic signals, epigenetic tracks, genetic variants,

All biological processes like cell growth, cell differentiation, development, and aging requires a series of steps which are characterized by gene regulation. Studies have shown that gene regulation is the key to various traits and diseases. Various factors affect the gene regulation which includes genetic signals, epigenetic tracks, genetic variants, etc. Deciphering and cataloging these functional genetic elements in the non-coding regions of the genome is one of the biggest challenges in precision medicine and genetic research. This thesis presents two different approaches to identifying these elements: TreeMap and DeepCORE. The first approach involves identifying putative causal genetic variants in cis-eQTL accounting for multisite effects and genetic linkage at a locus. TreeMap performs an organized search for individual and multiple causal variants using a tree guided nested machine learning method. DeepCORE on the other hand explores novel deep learning techniques that models the relationship between genetic, epigenetic and transcriptional patterns across tissues and cell lines and identifies co-operative regulatory elements that affect gene regulation. These two methods are believed to be the link for genotype-phenotype association and a necessary step to explaining various complex diseases and missing heritability.
ContributorsChandrashekar, Pramod Bharadwaj (Author) / Liu, Li (Thesis advisor) / Runger, George C. (Committee member) / Dinu, Valentin (Committee member) / Arizona State University (Publisher)
Created2020