Discovering Partial-Value Associations and Applications

Fok, Ting Yan

Existing machine learning and data mining techniques have difficulty in handling three characteristics of real-world data sets altogether in a computationally efficient way: (1) different data types with both categorical data and numeric data, (2) different variable relations in different…

Existing machine learning and data mining techniques have difficulty in handling three characteristics of real-world data sets altogether in a computationally efficient way: (1) different data types with both categorical data and numeric data, (2) different variable relations in different value ranges of variables, and (3) unknown variable dependency.This dissertation developed a Partial-Value Association Discovery (PVAD) algorithm to overcome the above drawbacks in existing techniques. It also enables the discovery of partial-value and full-value variable associations showing both effects of individual variables and interactive effects of multiple variables. The algorithm is compared with Association rule mining and Decision Tree for validation purposes. The results show that the PVAD algorithm can overcome the shortcomings of existing methods. The second part of this dissertation focuses on knee point detection on noisy data. This extended research topic was inspired during the investigation into categorization for numeric data, which corresponds to Step 1 of the PVAD algorithm. A new mathematical definition of knee point on discrete data is introduced. Due to the unavailability of ground truth data or benchmark data sets, functions used to generate synthetic data are carefully selected and defined. These functions are subsequently employed to create the data sets for this experiment. These synthetic data sets are useful for systematically evaluating and comparing the performance of existing methods. Additionally, a deep-learning model is devised for this problem. Experiments show that the proposed model surpasses existing methods in all synthetic data sets, regardless of whether the samples have single or multiple knee points. The third section presents the application results of the PVAD algorithm to real-world data sets in various domains. These include energy consumption data of an Arizona State University (ASU) building, Computer Network, and ASU Engineering Freshmen Retention. The PVAD algorithm is utilized to create an associative network for energy consumption modeling, analyze univariate and multivariate measures of network flow variables, and identify common and uncommon characteristics related to engineering student retention after their first year at the university. The findings indicate that the PVAD algorithm offers the advantage and capability to uncover variable relationships.

Copyright Statement