Matching Items (2)
Filtering by

Clear all filters

155725-Thumbnail Image.png
Description
Random forest (RF) is a popular and powerful technique nowadays. It can be used for classification, regression and unsupervised clustering. In its original form introduced by Leo Breiman, RF is used as a predictive model to generate predictions for new observations. Recent researches have proposed several methods based on RF

Random forest (RF) is a popular and powerful technique nowadays. It can be used for classification, regression and unsupervised clustering. In its original form introduced by Leo Breiman, RF is used as a predictive model to generate predictions for new observations. Recent researches have proposed several methods based on RF for feature selection and for generating prediction intervals. However, they are limited in their applicability and accuracy. In this dissertation, RF is applied to build a predictive model for a complex dataset, and used as the basis for two novel methods for biomarker discovery and generating prediction interval.

Firstly, a biodosimetry is developed using RF to determine absorbed radiation dose from gene expression measured from blood samples of potentially exposed individuals. To improve the prediction accuracy of the biodosimetry, day-specific models were built to deal with day interaction effect and a technique of nested modeling was proposed. The nested models can fit this complex data of large variability and non-linear relationships.

Secondly, a panel of biomarkers was selected using a data-driven feature selection method as well as handpick, considering prior knowledge and other constraints. To incorporate domain knowledge, a method called Know-GRRF was developed based on guided regularized RF. This method can incorporate domain knowledge as a penalized term to regulate selection of candidate features in RF. It adds more flexibility to data-driven feature selection and can improve the interpretability of models. Know-GRRF showed significant improvement in cross-species prediction when cross-species correlation was used to guide selection of biomarkers. The method can also compete with existing methods using intrinsic data characteristics as alternative of domain knowledge in simulated datasets.

Lastly, a novel non-parametric method, RFerr, was developed to generate prediction interval using RF regression. This method is widely applicable to any predictive models and was shown to have better coverage and precision than existing methods on the real-world radiation dataset, as well as benchmark and simulated datasets.
ContributorsGuan, Xin (Author) / Liu, Li (Thesis advisor) / Runger, George C. (Thesis advisor) / Dinu, Valentin (Committee member) / Arizona State University (Publisher)
Created2017
132249-Thumbnail Image.png
Description
The purpose of this investigation is to apply a machine learning algorithm with de-identified, historic oncology clinical trial data to assess the theoretical understanding of predictive modeling to derive potential clinical practice recommendations. Within this study, electronic medical records from the HonorHealth Virginia G. Piper Institute will undergo data visualization

The purpose of this investigation is to apply a machine learning algorithm with de-identified, historic oncology clinical trial data to assess the theoretical understanding of predictive modeling to derive potential clinical practice recommendations. Within this study, electronic medical records from the HonorHealth Virginia G. Piper Institute will undergo data visualization to identify potential correlations and trends critical for model creation as well as further identify potential expansions or limitations of scope regarding model purpose. Hypothesis pursued post data visualization was the development of a predictive model for 6-month survival. Current standard is estimated physician accuracy at 56.5% accuracy at 6 months out. This study created supervised learning models using decision trees, KNN, SVM and Ensemble methods using combinations of LASSO Logistic Regression and Know-GRFF Random Forest for feature selection. SVM trained on a combined set of LASSO and Know-GRRF featured produced the highest performing model at 75.5% with an AUC of 0.82. This study demonstrates the potential for applying predictive modeling on readily available EMR records to drive clinical practice recommendations. The models developed could potentially, with further development, be used as an ancillary tool for jumpstarting patient-physician conversations on survival and life expectancy.
ContributorsLi, Richard Longfei (Co-author) / Liu, Li (Co-author, Thesis director) / Gosselin, Kevin (Co-author, Committee member) / Harrington Bioengineering Program (Contributor, Contributor) / Barrett, The Honors College (Contributor)
Created2019-05