Search Content

A model fusion based framework for imbalanced classification problem with noisy dataset

Description

Data imbalance and data noise often coexist in real world datasets. Data imbalance affects the learning classifier by degrading the recognition power of the classifier on the minority class, while data noise affects the learning classifier by providing inaccurate information and thus misleads the classifier. Because of these differences, data…

Data imbalance and data noise often coexist in real world datasets. Data imbalance affects the learning classifier by degrading the recognition power of the classifier on the minority class, while data noise affects the learning classifier by providing inaccurate information and thus misleads the classifier. Because of these differences, data imbalance and data noise have been treated separately in the data mining field. Yet, such approach ignores the mutual effects and as a result may lead to new problems. A desirable solution is to tackle these two issues jointly. Noting the complementary nature of generative and discriminative models, this research proposes a unified model fusion based framework to handle the imbalanced classification with noisy dataset.

The phase I study focuses on the imbalanced classification problem. A generative classifier, Gaussian Mixture Model (GMM) is studied which can learn the distribution of the imbalance data to improve the discrimination power on imbalanced classes. By fusing this knowledge into cost SVM (cSVM), a CSG method is proposed. Experimental results show the effectiveness of CSG in dealing with imbalanced classification problems.

The phase II study expands the research scope to include the noisy dataset into the imbalanced classification problem. A model fusion based framework, K Nearest Gaussian (KNG) is proposed. KNG employs a generative modeling method, GMM, to model the training data as Gaussian mixtures and form adjustable confidence regions which are less sensitive to data imbalance and noise. Motivated by the K-nearest neighbor algorithm, the neighboring Gaussians are used to classify the testing instances. Experimental results show KNG method greatly outperforms traditional classification methods in dealing with imbalanced classification problems with noisy dataset.

The phase III study addresses the issues of feature selection and parameter tuning of KNG algorithm. To further improve the performance of KNG algorithm, a Particle Swarm Optimization based method (PSO-KNG) is proposed. PSO-KNG formulates model parameters and data features into the same particle vector and thus can search the best feature and parameter combination jointly. The experimental results show that PSO can greatly improve the performance of KNG with better accuracy and much lower computational cost.

ContributorsHe, Miao (Author) / Wu, Teresa (Thesis advisor) / Li, Jing (Committee member) / Silva, Alvin (Committee member) / Borror, Connie (Committee member) / Arizona State University (Publisher)

Created2014

Applying Industrial Engineering to Optimize Swim Stroke Economy

Description

The U.S. Navy and other amphibious military organizations utilize a derivation of the traditional side stroke called the Combat Side Stroke, or CSS, and tout it as the most efficient technique available. Citing its low aerobic requirements and slow yet powerful movements as superior to the traditionally-best front crawl (freestyle),…

The U.S. Navy and other amphibious military organizations utilize a derivation of the traditional side stroke called the Combat Side Stroke, or CSS, and tout it as the most efficient technique available. Citing its low aerobic requirements and slow yet powerful movements as superior to the traditionally-best front crawl (freestyle), the CSS is the go-to stroke for any operation in the water. The purpose of this thesis is to apply principles of Industrial Engineering to a real-world situation not typically approached from a perspective of optimization. I will analyze pre-existing data about various swim strokes in order to compare them in terms of efficiency for different variables. These variables include calories burned, speed, and strokes per unit distance, as well as their interactions. Calories will be measured by heart rate monitors, converting BPM to calories burned. Speed will be measured by stopwatch and observer. Strokes per unit distance will be measured by observer. The strokes to be analyzed include the breast stroke, crawl stroke, butterfly, and combat side stroke. The goal is to informally test the U.S. Navy's claim that the combat side stroke is the optimum stroke to conserve energy while covering distance. Because of limitations in the scope of the project, analysis will be done using data collected from literary sources rather than through experimentation. This thesis will include a design of experiment to test the findings here in practical study. The main method of analysis will be linear programming, followed by hypothesis testing, culminating in a design of experiment for future progress on this topic.

ContributorsGoodsell, Kevin Lewis (Author) / McCarville, Daniel R. (Thesis director) / Kashiwagi, Jacob (Committee member) / Industrial, Systems (Contributor) / Barrett, The Honors College (Contributor)

Created2014-12

Intervention Strategies for the DoD Acquisition Process Using Simulation

Description

The current Enterprise Requirements and Acquisition Model (ERAM), a discrete event simulation of the major tasks and decisions within the DoD acquisition system, identifies several what-if intervention strategies to improve program completion time. However, processes that contribute to the program acquisition completion time were not explicitly identified in the simulation…

The current Enterprise Requirements and Acquisition Model (ERAM), a discrete event simulation of the major tasks and decisions within the DoD acquisition system, identifies several what-if intervention strategies to improve program completion time. However, processes that contribute to the program acquisition completion time were not explicitly identified in the simulation study. This research seeks to determine the acquisition processes that contribute significantly to total simulated program time in the acquisition system for all programs reaching Milestone C. Specifically, this research examines the effect of increased scope management, technology maturity, and decreased variation and mean process times in post-Design Readiness Review contractor activities by performing additional simulation analyses. Potential policies are formulated from the results to further improve program acquisition completion time.

ContributorsWorger, Danielle Marie (Author) / Wu, Teresa (Thesis director) / Shunk, Dan (Committee member) / Wirthlin, J. Robert (Committee member) / Industrial, Systems (Contributor) / Barrett, The Honors College (Contributor)

Created2013-05

Data and Predictive Analytics for Energy Use

Description

The overall energy consumption around the United States has not been reduced even with the advancement of technology over the past decades. Deficiencies exist between design and actual energy performances. Energy Infrastructure Systems (EIS) are impacted when the amount of energy production cannot be accurately and efficiently forecasted. Inaccurate engineering…

The overall energy consumption around the United States has not been reduced even with the advancement of technology over the past decades. Deficiencies exist between design and actual energy performances. Energy Infrastructure Systems (EIS) are impacted when the amount of energy production cannot be accurately and efficiently forecasted. Inaccurate engineering assumptions can result when there is a lack of understanding on how energy systems can operate in real-world applications. Energy systems are complex, which results in unknown system behaviors, due to an unknown structural system model. Currently, there exists a lack of data mining techniques in reverse engineering, which are needed to develop efficient structural system models. In this project, a new type of reverse engineering algorithm has been applied to a year's worth of energy data collected from an ASU research building called MacroTechnology Works, to identify the structural system model. Developing and understanding structural system models is the first step in creating accurate predictive analytics for energy production. The associative network of the building's data will be highlighted to accurately depict the structural model. This structural model will enhance energy infrastructure systems' energy efficiency, reduce energy waste, and narrow the gaps between energy infrastructure design, planning, operation and management (DPOM).

ContributorsCamarena, Raquel Jimenez (Author) / Chong, Oswald (Thesis director) / Ye, Nong (Committee member) / Industrial, Systems (Contributor) / Barrett, The Honors College (Contributor)

Created2016-12

Statistical Analysis of Power Differences between Experimental Design Software Packages

Description

Based on findings of previous studies, there was speculation that two well-known experimental design software packages, JMP and Design Expert, produced varying power outputs given the same design and user inputs. For context and scope, another popular experimental design software package, Minitab® Statistical Software version 17, was added to the…

Based on findings of previous studies, there was speculation that two well-known experimental design software packages, JMP and Design Expert, produced varying power outputs given the same design and user inputs. For context and scope, another popular experimental design software package, Minitab® Statistical Software version 17, was added to the comparison. The study compared multiple test cases run on the three software packages with a focus on 2k and 3K factorial design and adjusting the standard deviation effect size, number of categorical factors, levels, number of factors, and replicates. All six cases were run on all three programs and were attempted to be run at one, two, and three replicates each. There was an issue at the one replicate stage, however—Minitab does not allow for only one replicate full factorial designs and Design Expert will not provide power outputs for only one replicate unless there are three or more factors. From the analysis of these results, it was concluded that the differences between JMP 13 and Design Expert 10 were well within the margin of error and likely caused by rounding. The differences between JMP 13, Design Expert 10, and Minitab 17 on the other hand indicated a fundamental difference in the way Minitab addressed power calculation compared to the latest versions of JMP and Design Expert. This was found to be likely a cause of Minitab’s dummy variable coding as its default instead of the orthogonal coding default of the other two. Although dummy variable and orthogonal coding for factorial designs do not show a difference in results, the methods affect the overall power calculations. All three programs can be adjusted to use either method of coding, but the exact instructions for how are difficult to find and thus a follow-up guide on changing the coding for factorial variables would improve this issue.

ContributorsArmstrong, Julia Robin (Author) / McCarville, Daniel R. (Thesis director) / Montgomery, Douglas (Committee member) / Industrial, Systems (Contributor, Contributor) / Barrett, The Honors College (Contributor)

Created2017-05

Modelling Megacities: An Approach to Modelling Dense Urban Area

Description

In 2010, for the first time in human history, more than half of the world's total population lived in cities; this number is expected to increase to 60% or more by 2050. The goal of this research effort is to create a comprehensive model and modelling framework for megacities, middleweight…

In 2010, for the first time in human history, more than half of the world's total population lived in cities; this number is expected to increase to 60% or more by 2050. The goal of this research effort is to create a comprehensive model and modelling framework for megacities, middleweight cities, and urban agglomerations, collectively referred to as dense urban areas. The motivation for this project comes from the United States Army's desire for readiness in all operating environments including dense urban areas. Though there is valuable insight in research to support Army operational behaviors, megacities are of unique interest to nearly every societal sector imaginable. A novel application for determining both main effects and interactive effects between factors within a dense urban area is a Design of Experiments- providing insight on factor causations. Regression Modelling can also be employed for analysis of dense urban areas, providing wide ranging insights into correlations between factors and their interactions. Past studies involving megacities concern themselves with general trend of cities and their operation. This study is unique in its efforts to model a singular megacity to enable decision support for military operational planning, as well as potential decision support to city planners to increase the sustainability of these dense urban areas and megacities.

ContributorsMathesen, Logan Michael (Author) / Zenzen, Frances (Thesis director) / Jennings, Cheryl (Committee member) / Industrial, Systems (Contributor) / Barrett, The Honors College (Contributor)

Created2016-05

A Data Mining Approach to Modeling Customer Preference: A Case Study of Intel Corporation

Description

Understanding customer preference is crucial for new product planning and marketing decisions. This thesis explores how historical data can be leveraged to understand and predict customer preference. This thesis presents a decision support framework that provides a holistic view on customer preference by following a two-phase procedure. Phase-1 uses cluster…

Understanding customer preference is crucial for new product planning and marketing decisions. This thesis explores how historical data can be leveraged to understand and predict customer preference. This thesis presents a decision support framework that provides a holistic view on customer preference by following a two-phase procedure. Phase-1 uses cluster analysis to create product profiles based on which customer profiles are derived. Phase-2 then delves deep into each of the customer profiles and investigates causality behind their preference using Bayesian networks. This thesis illustrates the working of the framework using the case of Intel Corporation, world’s largest semiconductor manufacturing company.

ContributorsRam, Sudarshan Venkat (Author) / Kempf, Karl G. (Thesis advisor) / Wu, Teresa (Thesis advisor) / Ju, Feng (Committee member) / Arizona State University (Publisher)

Created2017

A Disease Progression Modeling Framework for Nonalcoholic Steatohepatitis Using Multiparametric Serial Magnetic Resonance Imaging and Elastography

Description

Nonalcoholic Steatohepatitis (NASH) is a severe form of Nonalcoholic fatty liverdisease, that is caused due to excessive calorie intake, sedentary lifestyle and in the absence of severe alcohol consumption. It is widely prevalent in the United States and in many other developed countries, affecting up to 25 percent of the population. Due to…

Nonalcoholic Steatohepatitis (NASH) is a severe form of Nonalcoholic fatty liverdisease, that is caused due to excessive calorie intake, sedentary lifestyle and in the absence of severe alcohol consumption. It is widely prevalent in the United States and in many other developed countries, affecting up to 25 percent of the population. Due to being asymptotic, it usually goes unnoticed and may lead to liver failure if not treated at the right time. Currently, liver biopsy is the gold standard to diagnose NASH, but being an invasive procedure, it comes with it's own complications along with the inconvenience of sampling repeated measurements over a period of time. Hence, noninvasive procedures to assess NASH are urgently required. Magnetic Resonance Elastography (MRE) based Shear Stiffness and Loss Modulus along with Magnetic Resonance Imaging based proton density fat fraction have been successfully combined to predict NASH stages However, their role in the prediction of disease progression still remains to be investigated. This thesis thus looks into combining features from serial MRE observations to develop statistical models to predict NASH progression. It utilizes data from an experiment conducted on male mice to develop progressive and regressive NASH and trains ordinal models, ordered probit regression and ordinal forest on labels generated from a logistic regression model. The models are assessed on histological data collected at the end point of the experiment. The models developed provide a framework to utilize a non-invasive tool to predict NASH disease progression.

ContributorsDeshpande, Eeshan (Author) / Ju, Feng (Thesis advisor) / Wu, Teresa (Committee member) / Yan, Hao (Committee member) / Arizona State University (Publisher)

Created2021

Filtering by