Matching Items (6)

Filtering by

Clear all filters

155559-Thumbnail Image.png

Minimizing Dataset Size Requirements for Machine Learning

Description

Machine learning methodologies are widely used in almost all aspects of software engineering. An effective machine learning model requires large amounts of data to achieve high accuracy. The data used for classification is mostly labeled, which is difficult to obtain.

Machine learning methodologies are widely used in almost all aspects of software engineering. An effective machine learning model requires large amounts of data to achieve high accuracy. The data used for classification is mostly labeled, which is difficult to obtain. The dataset requires both high costs and effort to accurately label the data into different classes. With abundance of data, it becomes necessary that all the data should be labeled for its proper utilization and this work focuses on reducing the labeling effort for large dataset. The thesis presents a comparison of different classifiers performance to test if small set of labeled data can be utilized to build accurate models for high prediction rate. The use of small dataset for classification is then extended to active machine learning methodology where, first a one class classifier will predict the outliers in the data and then the outlier samples are added to a training set for support vector machine classifier for labeling the unlabeled data. The labeling of dataset can be scaled up to avoid manual labeling and building more robust machine learning methodologies.

Contributors

Agent

Created

Date Created
2017

155468-Thumbnail Image.png

Optimizing Performance Measures in Classification Using Ensemble Learning Methods

Description

Ensemble learning methods like bagging, boosting, adaptive boosting, stacking have traditionally shown promising results in improving the predictive accuracy in classification. These techniques have recently been widely used in various domains and applications owing to the improvements in computational efficiency

Ensemble learning methods like bagging, boosting, adaptive boosting, stacking have traditionally shown promising results in improving the predictive accuracy in classification. These techniques have recently been widely used in various domains and applications owing to the improvements in computational efficiency and distributed computing advances. However, with the advent of wide variety of applications of machine learning techniques to class imbalance problems, further focus is needed to evaluate, improve and optimize other performance measures such as sensitivity (true positive rate) and specificity (true negative rate) in classification. This thesis demonstrates a novel approach to evaluate and optimize the performance measures (specifically sensitivity and specificity) using ensemble learning methods for classification that can be especially useful in class imbalanced datasets. In this thesis, ensemble learning methods (specifically bagging and boosting) are used to optimize the performance measures (sensitivity and specificity) on a UC Irvine (UCI) 130 hospital diabetes dataset to predict if a patient will be readmitted to the hospital based on various feature vectors. From the experiments conducted, it can be empirically concluded that, by using ensemble learning methods, although accuracy does improve to some margin, both sensitivity and specificity are optimized significantly and consistently over different cross validation approaches. The implementation and evaluation has been done on a subset of the large UCI 130 hospital diabetes dataset. The performance measures of ensemble learners are compared to the base machine learning classification algorithms such as Naive Bayes, Logistic Regression, k Nearest Neighbor, Decision Trees and Support Vector Machines.

Contributors

Agent

Created

Date Created
2017

156331-Thumbnail Image.png

Graph Search as a Feature in Imperative/Procedural Programming Languages

Description

Graph theory is a critical component of computer science and software engineering, with algorithms concerning graph traversal and comprehension powering much of the largest problems in both industry and research. Engineers and researchers often have an accurate view of their

Graph theory is a critical component of computer science and software engineering, with algorithms concerning graph traversal and comprehension powering much of the largest problems in both industry and research. Engineers and researchers often have an accurate view of their target graph, however they struggle to implement a correct, and efficient, search over that graph.

To facilitate rapid, correct, efficient, and intuitive development of graph based solutions we propose a new programming language construct - the search statement. Given a supra-root node, a procedure which determines the children of a given parent node, and optional definitions of the fail-fast acceptance or rejection of a solution, the search statement can conduct a search over any graph or network. Structurally, this statement is modelled after the common switch statement and is put into a largely imperative/procedural context to allow for immediate and intuitive development by most programmers. The Go programming language has been used as a foundation and proof-of-concept of the search statement. A Go compiler is provided which implements this construct.

Contributors

Agent

Created

Date Created
2018

3D - Patch Based Machine Learning Systems for Alzheimer’s Disease classification via 18F-FDG PET Analysis

Description

Alzheimer’s disease (AD), is a chronic neurodegenerative disease that usually starts slowly and gets worse over time. It is the cause of 60% to 70% of cases of dementia. There is growing interest in identifying brain image biomarkers that hel

Alzheimer’s disease (AD), is a chronic neurodegenerative disease that usually starts slowly and gets worse over time. It is the cause of 60% to 70% of cases of dementia. There is growing interest in identifying brain image biomarkers that help evaluate AD risk pre-symptomatically. High-dimensional non-linear pattern classification methods have been applied to structural magnetic resonance images (MRI’s) and used to discriminate between clinical groups in Alzheimers progression. Using Fluorodeoxyglucose (FDG) positron emission tomography (PET) as the pre- ferred imaging modality, this thesis develops two independent machine learning based patch analysis methods and uses them to perform six binary classification experiments across different (AD) diagnostic categories. Specifically, features were extracted and learned using dimensionality reduction and dictionary learning & sparse coding by taking overlapping patches in and around the cerebral cortex and using them as fea- tures. Using AdaBoost as the preferred choice of classifier both methods try to utilize 18F-FDG PET as a biological marker in the early diagnosis of Alzheimer’s . Addi- tional we investigate the involvement of rich demographic features (ApoeE3, ApoeE4 and Functional Activities Questionnaires (FAQ)) in classification. The experimental results on Alzheimer’s Disease Neuroimaging initiative (ADNI) dataset demonstrate the effectiveness of both the proposed systems. The use of 18F-FDG PET may offer a new sensitive biomarker and enrich the brain imaging analysis toolset for studying the diagnosis and prognosis of AD.

Contributors

Agent

Created

Date Created
2017

158206-Thumbnail Image.png

Diversifying Relevant Search Results from Social Media Using Community Contributed Images

Description

Availability of affordable image and video capturing devices as well as rapid development of social networking and content sharing websites has led to the creation of new type of content, Social Media. Any system serving the end user’s query search

Availability of affordable image and video capturing devices as well as rapid development of social networking and content sharing websites has led to the creation of new type of content, Social Media. Any system serving the end user’s query search request should not only take the relevant images into consideration but they also need to be divergent for a well-rounded description of a query. As a result, the automated optimization of image retrieval results that are also divergent becomes exceedingly important.

The main focus of this thesis is to use visual description of a landmark by choosing the most diverse pictures that best describe all the details of the queried location from community-contributed datasets. For this, an end-to-end framework has been built, to retrieve relevant results that are also diverse. Different retrieval re-ranking and diversification strategies are evaluated to find a balance between relevance and diversification. Clustering techniques are employed to improve divergence. A unique fusion approach has been adopted to overcome the dilemma of selecting an appropriate clustering technique and the corresponding parameters, given a set of data to be investigated. Extensive experiments have been conducted on the Flickr Div150Cred dataset that has 30 different landmark locations. The results obtained are promising when evaluated on metrics for relevance and diversification.

Contributors

Agent

Created

Date Created
2020

161629-Thumbnail Image.png

Predicting Student Dropout in Self-Paced MOOC Course

Description

One persisting problem in Massive Open Online Courses (MOOCs) is the issue of student dropout from these courses. The prediction of student dropout from MOOC courses can identify the factors responsible for such an event and it can further initiate

One persisting problem in Massive Open Online Courses (MOOCs) is the issue of student dropout from these courses. The prediction of student dropout from MOOC courses can identify the factors responsible for such an event and it can further initiate intervention before such an event to increase student success in MOOC. There are different approaches and various features available for the prediction of student’s dropout in MOOC courses.In this research, the data derived from the self-paced math course ‘College Algebra and Problem Solving’ offered on the MOOC platform Open edX offered by Arizona State University (ASU) from 2016 to 2020 was considered. This research aims to predict the dropout of students from a MOOC course given a set of features engineered from the learning of students in a day. Machine Learning (ML) model used is Random Forest (RF) and this model is evaluated using the validation metrics like accuracy, precision, recall, F1-score, Area Under the Curve (AUC), Receiver Operating Characteristic (ROC) curve. The average rate of student learning progress was found to have more impact than other features. The model developed can predict the dropout or continuation of students on any given day in the MOOC course with an accuracy of 87.5%, AUC of 94.5%, precision of 88%, recall of 87.5%, and F1-score of 87.5% respectively. The contributing features and interactions were explained using Shapely values for the prediction of the model. The features engineered in this research are predictive of student dropout and could be used for similar courses to predict student dropout from the course. This model can also help in making interventions at a critical time to help students succeed in this MOOC course.

Contributors

Agent

Created

Date Created
2021