Search Content

A Study on the Relationship between Data Leakage and Overoptimistic Estimates of the Performance of Machine Learning Models

Description

In this thesis, six experiments which were computer simulations were conducted in order to replicate the negative association between sample size and accuracy that is repeatedly found in ML literature by accounting for data leakage and publication bias. The reason why it is critical to understand why this negative association…

In this thesis, six experiments which were computer simulations were conducted in order to replicate the negative association between sample size and accuracy that is repeatedly found in ML literature by accounting for data leakage and publication bias. The reason why it is critical to understand why this negative association is occurring is that in published studies, there have been multiple reports that the accuracies in ML models are overoptimistic leading to cases where the results are irreproducible despite conducting multiple trials and experiments. Additionally, after replicating the negative association between sample size and accuracy, parametric curves (learning curves with the parametric function) were fitted along the empirical learning curves in order to evaluate the performance. It was found that there is a significant variance in accuracies when the sample size is small, but little to no variation when the sample size is large. In other words, the empirical learning curves with data leakage and publication bias were able to achieve the same accuracy as the learning curve without data leakage at a large sample size.

ContributorsKottooru, Rishab (Author) / Berisha, Visar (Thesis director) / Dasarathy, Gautam (Committee member) / Saidi, Pouria (Committee member) / Barrett, The Honors College (Contributor) / Chemical Engineering Program (Contributor)

Created2023-05

Evaluating the Efficiency of Quantum Simulators using Practical Application Benchmarks

Description

Quantum computing holds the potential to revolutionize various industries by solving problems that classical computers cannot solve efficiently. However, building quantum computers is still in its infancy, and simulators are currently the best available option to explore the potential of quantum computing. Therefore, developing comprehensive benchmarking suites for quantum computing…

Quantum computing holds the potential to revolutionize various industries by solving problems that classical computers cannot solve efficiently. However, building quantum computers is still in its infancy, and simulators are currently the best available option to explore the potential of quantum computing. Therefore, developing comprehensive benchmarking suites for quantum computing simulators is essential to evaluate their performance and guide the development of future quantum algorithms and hardware. This study presents a systematic evaluation of quantum computing simulators’ performance using a benchmarking suite. The benchmarking suite is designed to meet the industry-standard performance benchmarks established by the Defense Advanced Research Projects Agency (DARPA) and includes standardized test data and comparison metrics that encompass a wide range of applications, deep neural network models, and optimization techniques. The thesis is divided into two parts to cover basic quantum algorithms and variational quantum algorithms for practical machine-learning tasks. In the first part, the run time and memory performance of quantum computing simulators are analyzed using basic quantum algorithms. The performance is evaluated using standardized test data and comparison metrics that cover fundamental quantum algorithms, including Quantum Fourier Transform (QFT), Inverse Quantum Fourier Transform (IQFT), Quantum Adder, and Variational Quantum Eigensolver (VQE). The analysis provides valuable insights into the simulators’ strengths and weaknesses and highlights the need for further development to enhance their performance. In the second part, benchmarks are developed using variational quantum algorithms for practical machine learning tasks such as image classification, natural language processing, and recommendation. The benchmarks address several unique challenges posed by benchmarking quantum machine learning (QML), including the effect of optimizations on time-to-solution, the stochastic nature of training, the inclusion of hybrid quantum-classical layers, and the diversity of software and hardware systems. The findings offer valuable insights into the simulators’ ability to solve practical machine-learning tasks and pinpoint areas for future research and enhancement. In conclusion, this study provides a rigorous evaluation of quantum computing simulators’ performance using a benchmarking suite that meets industry-standard performance benchmarks.

ContributorsSathyakumar, Rajesh (Author) / Spanias, Andreas (Thesis advisor) / Sen, Arunabha (Thesis advisor) / Dasarathy, Gautam (Committee member) / Arizona State University (Publisher)

Created2023

Machine Learning-based Analysis of the Relationship Between the Human Gut Microbiome and Bone Health

Description

The Human Gut Microbiome (GM) modulates a variety of structural, metabolic, and protective functions to benefit the host. A few recent studies also support the role of the gut microbiome in the regulation of bone health. The relationship between GM and bone health was analyzed based on the data collected…

The Human Gut Microbiome (GM) modulates a variety of structural, metabolic, and protective functions to benefit the host. A few recent studies also support the role of the gut microbiome in the regulation of bone health. The relationship between GM and bone health was analyzed based on the data collected from a group of twenty-three adolescent boys and girls who participated in a controlled feeding study, during which two different doses (0 g/d fiber and 12 g/d fiber) of Soluble Corn Fiber (SCF) were added to their diet. This analysis was performed by predicting measures of Bone Mineral Density (BMD) and Bone Mineral Content (BMC) which are indicators of bone strength, using the GM sequence of proportions of 178 microbes collected from 23 subjects, by building a machine learning regression model. The model developed was evaluated by calculating performance metrics such as Root Mean Squared Error, Pearson’s correlation coefficient, and Spearman’s rank correlation coefficient, using cross-validation. A noticeable correlation was observed between the GM and bone health, and it was observed that the overall prediction correlation was higher with SCF intervention (r ~ 0.51). The genera of microbes that played an important role in this relationship were identified. Eubacterium (g), Bacteroides (g), Megamonas (g), Acetivibrio (g), Faecalibacterium (g), and Paraprevotella (g) were some of the microbes that showed an increase in proportion with SCF intervention.

ContributorsKetha Hazarath, Pravallika Reddy (Author) / Bliss, Daniel (Thesis advisor) / Whisner, Corrie (Committee member) / Dasarathy, Gautam (Committee member) / Arizona State University (Publisher)

Created2020

Differentiable Programming for Physics-based Hyperspectral Unmixing

Description

Hyperspectral unmixing is an important remote sensing task with applications including material identification and analysis. Characteristic spectral features make many pure materials identifiable from their visible-to-infrared spectra, but quantifying their presence within a mixture is a challenging task due to nonlinearities and factors of variation. In this thesis, physics-based approaches…

Hyperspectral unmixing is an important remote sensing task with applications including material identification and analysis. Characteristic spectral features make many pure materials identifiable from their visible-to-infrared spectra, but quantifying their presence within a mixture is a challenging task due to nonlinearities and factors of variation. In this thesis, physics-based approaches are incorporated into an end-to-end spectral unmixing algorithm via differentiable programming. First, sparse regularization and constraints are implemented by adding differentiable penalty terms to a cost function to avoid unrealistic predictions. Secondly, a physics-based dispersion model is introduced to simulate realistic spectral variation, and an efficient method to fit the parameters is presented. Then, this dispersion model is utilized as a generative model within an analysis-by-synthesis spectral unmixing algorithm. Further, a technique for inverse rendering using a convolutional neural network to predict parameters of the generative model is introduced to enhance performance and speed when training data are available. Results achieve state-of-the-art on both infrared and visible-to-near-infrared (VNIR) datasets as compared to baselines, and show promise for the synergy between physics-based models and deep learning in hyperspectral unmixing in the future.

ContributorsJaniczek, John (Author) / Jayasuriya, Suren (Thesis advisor) / Dasarathy, Gautam (Thesis advisor) / Christensen, Phil (Committee member) / Arizona State University (Publisher)

Created2020

Characterizing the Performance of Machine Learning Algorithms: A Study and Novel Techniques

Description

Classification in machine learning is quite crucial to solve many problems that the world is presented with today. Therefore, it is key to understand one’s problem and develop an efficient model to achieve a solution. One technique to achieve greater model selection and thus further ease in problem solving is…

Classification in machine learning is quite crucial to solve many problems that the world is presented with today. Therefore, it is key to understand one’s problem and develop an efficient model to achieve a solution. One technique to achieve greater model selection and thus further ease in problem solving is estimation of the Bayes Error Rate. This paper provides the development and analysis of two methods used to estimate the Bayes Error Rate on a given set of data to evaluate performance. The first method takes a “global” approach, looking at the data as a whole, and the second is more “local”—partitioning the data at the outset and then building up to a Bayes Error Estimation of the whole. It is found that one of the methods provides an accurate estimation of the true Bayes Error Rate when the dataset is at high dimension, while the other method provides accurate estimation at large sample size. This second conclusion, in particular, can have significant ramifications on “big data” problems, as one would be able to clarify the distribution with an accurate estimation of the Bayes Error Rate by using this method.

ContributorsLattus, Robert (Author) / Dasarathy, Gautam (Thesis director) / Berisha, Visar (Committee member) / Turaga, Pavan (Committee member) / Barrett, The Honors College (Contributor) / Electrical Engineering Program (Contributor)

Created2021-12

Filtering by

A Study on the Relationship between Data Leakage and Overoptimistic Estimates of the Performance of Machine Learning Models

Evaluating the Efficiency of Quantum Simulators using Practical Application Benchmarks

Machine Learning-based Analysis of the Relationship Between the Human Gut Microbiome and Bone Health

Differentiable Programming for Physics-based Hyperspectral Unmixing

Characterizing the Performance of Machine Learning Algorithms: A Study and Novel Techniques