Search Content

Learning from asymmetric models and matched pairs

Description

With the increase in computing power and availability of data, there has never been a greater need to understand data and make decisions from it. Traditional statistical techniques may not be adequate to handle the size of today's data or the complexities of the information hidden within the data. Thus…

With the increase in computing power and availability of data, there has never been a greater need to understand data and make decisions from it. Traditional statistical techniques may not be adequate to handle the size of today's data or the complexities of the information hidden within the data. Thus knowledge discovery by machine learning techniques is necessary if we want to better understand information from data. In this dissertation, we explore the topics of asymmetric loss and asymmetric data in machine learning and propose new algorithms as solutions to some of the problems in these topics. We also studied variable selection of matched data sets and proposed a solution when there is non-linearity in the matched data. The research is divided into three parts. The first part addresses the problem of asymmetric loss. A proposed asymmetric support vector machine (aSVM) is used to predict specific classes with high accuracy. aSVM was shown to produce higher precision than a regular SVM. The second part addresses asymmetric data sets where variables are only predictive for a subset of the predictor classes. Asymmetric Random Forest (ARF) was proposed to detect these kinds of variables. The third part explores variable selection for matched data sets. Matched Random Forest (MRF) was proposed to find variables that are able to distinguish case and control without the restrictions that exists in linear models. MRF detects variables that are able to distinguish case and control even in the presence of interaction and qualitative variables.

ContributorsKoh, Derek (Author) / Runger, George C. (Thesis advisor) / Wu, Tong (Committee member) / Pan, Rong (Committee member) / Cesta, John (Committee member) / Arizona State University (Publisher)

Created2013

It's not all about the music: digital goods, social media, and the pressure of peers

Description

Social media offers a powerful platform for the independent digital content producer community to develop, disperse, and maintain their brands. In terms of information systems research, the broad majority of the work has not examined hedonic consumption on Social Media Sites (SMS). The focus has mostly been on the organizational…

Social media offers a powerful platform for the independent digital content producer community to develop, disperse, and maintain their brands. In terms of information systems research, the broad majority of the work has not examined hedonic consumption on Social Media Sites (SMS). The focus has mostly been on the organizational perspectives and utilitarian gains from these services. Unlike through traditional commerce channels, including e-commerce retailers, consumption enhancing hedonic utility is experienced differently in the context of a social media site; consequently, the dynamic of the decision-making process shifts when it is made in a social context. Previous research assumed a limited influence of a small, immediate group of peers. But the rules change when the network of peers expands exponentially. The assertion is that, while there are individual differences in the level of susceptibility to influence coming from others, these are not the most important pieces of the analysis--unlike research centered completely on influence. Rather, the context of the consumption can play an important role in the way social influence factors affect consumer behavior on Social Media Sites. Over the course of three studies, this dissertation will examine factors that influence consumer decision-making and the brand personalities created and interpreted in these SMS. Study one examines the role of different types of peer influence on consumer decision-making on Facebook. Study two observes the impact of different types of producer message posts with the different types of influence on decision-making on Twitter. Study three will conclude this work with an exploratory empirical investigation of actual twitter postings of a set of musicians. These studies contribute to the body of IS literature by evaluating the specific behavioral changes related to consumption in the context of digital social media: (a) the power of social influencers in contrast to personal preferences on SMS, (b) the effect on consumers of producer message types and content on SMS at both the profile level and the individual message level.

ContributorsSopha, Matthew (Author) / Santanam, Raghu T (Thesis advisor) / Goul, Kenneth M (Committee member) / Gu, Bin (Committee member) / Arizona State University (Publisher)

Created2013

Modeling time series data for supervised learning

Description

Temporal data are increasingly prevalent and important in analytics. Time series (TS) data are chronological sequences of observations and an important class of temporal data. Fields such as medicine, finance, learning science and multimedia naturally generate TS data. Each series provide a high-dimensional data vector that challenges the learning of…

Temporal data are increasingly prevalent and important in analytics. Time series (TS) data are chronological sequences of observations and an important class of temporal data. Fields such as medicine, finance, learning science and multimedia naturally generate TS data. Each series provide a high-dimensional data vector that challenges the learning of the relevant patterns This dissertation proposes TS representations and methods for supervised TS analysis. The approaches combine new representations that handle translations and dilations of patterns with bag-of-features strategies and tree-based ensemble learning. This provides flexibility in handling time-warped patterns in a computationally efficient way. The ensemble learners provide a classification framework that can handle high-dimensional feature spaces, multiple classes and interaction between features. The proposed representations are useful for classification and interpretation of the TS data of varying complexity. The first contribution handles the problem of time warping with a feature-based approach. An interval selection and local feature extraction strategy is proposed to learn a bag-of-features representation. This is distinctly different from common similarity-based time warping. This allows for additional features (such as pattern location) to be easily integrated into the models. The learners have the capability to account for the temporal information through the recursive partitioning method. The second contribution focuses on the comprehensibility of the models. A new representation is integrated with local feature importance measures from tree-based ensembles, to diagnose and interpret time intervals that are important to the model. Multivariate time series (MTS) are especially challenging because the input consists of a collection of TS and both features within TS and interactions between TS can be important to models. Another contribution uses a different representation to produce computationally efficient strategies that learn a symbolic representation for MTS. Relationships between the multiple TS, nominal and missing values are handled with tree-based learners. Applications such as speech recognition, medical diagnosis and gesture recognition are used to illustrate the methods. Experimental results show that the TS representations and methods provide better results than competitive methods on a comprehensive collection of benchmark datasets. Moreover, the proposed approaches naturally provide solutions to similarity analysis, predictive pattern discovery and feature selection.

ContributorsBaydogan, Mustafa Gokce (Author) / Runger, George C. (Thesis advisor) / Atkinson, Robert (Committee member) / Gel, Esma (Committee member) / Pan, Rong (Committee member) / Arizona State University (Publisher)

Created2012

Machine Learning Models for High-dimensional Biomedical Data

Description

The recent technological advances enable the collection of various complex, heterogeneous and high-dimensional data in biomedical domains. The increasing availability of the high-dimensional biomedical data creates the needs of new machine learning models for effective data analysis and knowledge discovery. This dissertation introduces several unsupervised and supervised methods to hel…

The recent technological advances enable the collection of various complex, heterogeneous and high-dimensional data in biomedical domains. The increasing availability of the high-dimensional biomedical data creates the needs of new machine learning models for effective data analysis and knowledge discovery. This dissertation introduces several unsupervised and supervised methods to help understand the data, discover the patterns and improve the decision making. All the proposed methods can generalize to other industrial fields.

The first topic of this dissertation focuses on the data clustering. Data clustering is often the first step for analyzing a dataset without the label information. Clustering high-dimensional data with mixed categorical and numeric attributes remains a challenging, yet important task. A clustering algorithm based on tree ensembles, CRAFTER, is proposed to tackle this task in a scalable manner.

The second part of this dissertation aims to develop data representation methods for genome sequencing data, a special type of high-dimensional data in the biomedical domain. The proposed data representation method, Bag-of-Segments, can summarize the key characteristics of the genome sequence into a small number of features with good interpretability.

The third part of this dissertation introduces an end-to-end deep neural network model, GCRNN, for time series classification with emphasis on both the accuracy and the interpretation. GCRNN contains a convolutional network component to extract high-level features, and a recurrent network component to enhance the modeling of the temporal characteristics. A feed-forward fully connected network with the sparse group lasso regularization is used to generate the final classification and provide good interpretability.

The last topic centers around the dimensionality reduction methods for time series data. A good dimensionality reduction method is important for the storage, decision making and pattern visualization for time series data. The CRNN autoencoder is proposed to not only achieve low reconstruction error, but also generate discriminative features. A variational version of this autoencoder has great potential for applications such as anomaly detection and process control.

ContributorsLin, Sangdi (Author) / Runger, George C. (Thesis advisor) / Kocher, Jean-Pierre A (Committee member) / Pan, Rong (Committee member) / Escobedo, Adolfo R. (Committee member) / Arizona State University (Publisher)

Created2018

Understanding, Analyzing and Predicting Online User Behavior

Description

Due to the growing popularity of the Internet and smart mobile devices, massive data has been produced every day, particularly, more and more users’ online behavior and activities have been digitalized. Making a better usage of the massive data and a better understanding of the user behavior become at the…

Due to the growing popularity of the Internet and smart mobile devices, massive data has been produced every day, particularly, more and more users’ online behavior and activities have been digitalized. Making a better usage of the massive data and a better understanding of the user behavior become at the very heart of industrial firms as well as the academia. However, due to the large size and unstructured format of user behavioral data, as well as the heterogeneous nature of individuals, it leveled up the difficulty to identify the SPECIFIC behavior that researchers are looking at, HOW to distinguish, and WHAT is resulting from the behavior. The difference in user behavior comes from different causes; in my dissertation, I am studying three circumstances of behavior that potentially bring in turbulent or detrimental effects, from precursory culture to preparatory strategy and delusory fraudulence. Meanwhile, I have access to the versatile toolkit of analysis: econometrics, quasi-experiment, together with machine learning techniques such as text mining, sentiment analysis, and predictive analytics etc. This study creatively leverages the power of the combined methodologies, and apply it beyond individual level data and network data. This dissertation makes a first step to discover user behavior in the newly boosting contexts. My study conceptualize theoretically and test empirically the effect of cultural values on rating and I find that an individualist cultural background are more likely to lead to deviation and more expression in review behaviors. I also find evidence of strategic behavior that users tend to leverage the reporting to increase the likelihood to maximize the benefits. Moreover, it proposes the features that moderate the preparation behavior. Finally, it introduces a unified and scalable framework for delusory behavior detection that meets the current needs to fully utilize multiple data sources.

ContributorsLi, Chunxiao (Author) / Gu, Bin (Thesis advisor) / Chen, Pei-Yu (Committee member) / Xiong, Hui (Committee member) / Arizona State University (Publisher)

Created2019

从资源配置角度研究中国商品期货市场有效性

Description中国商品期货市场经历30年发展，已初备协调资源分配、对冲经营风险的功能。但受产业自身和期货市场发展的制约，各期货品种市场有效性参差不齐。随着我国经济从增量阶段过渡到存量阶段，期货作为企业的价格管理和风险控制工具的重要性日益凸显，因此研究我国商品期货市场有效性具有非常好的现实意义。

本文开创性的从期货的基本功能——资源配置的角度出发，提出有效市场是指其期货价格能够对本行业社会资源起到合理的调配作用的市场。在内容安排上，本文首先总结了现有国际成熟期货品种的特点并找出能够反映期货对资源配置能力的四个指标假说，分别为期现回归性、利润波动性、库存波动性以及现金流变化，然后通过数学模型证明指标数据和品种成熟度的关联，最后应用该套指标对我国商品市场有效性进行检验。数学方法上，本文先采用Bai-Perron内生多重结构突变模型对时间序列进行突变点检验，然后对断点时间序列分别进行多元回归，并在剔除季节性和周期性后，通过平稳性检验、ARCH效应检验结果来确定相应的Garch模型，并用Garch模型来描述时间序列的波动性。

通过数学验证，我们认为期现回归性、利润波动性、库存波动性以及现金流变化这四个指标可以作为反映期货成熟度的检验指标，用该套方法对国内部分活跃品种检验后发现大连豆粕期货已经具备成熟品种的特征，本文认为豆粕期货市场是有效的；PTA、玉米淀粉期货的四个检验指标在近年来表现出时间序列优化的特点，但因时间较短尚不稳定，可以认为是接近成熟的品种；而螺纹钢和铝期货在多数指标上表现不佳，表明他们对社会资源配置能力较差，因此本文认为螺纹钢和铝期货市场是活跃但非有效的。通过进一步分析，本文认为品种的期现回归性差是制约其资源配置能力发挥的关键因素，而交易标的不明确、

仓单制作难度大、产业参与度低以及期货设计中的其他限制因素又是导致期现回归性差的重要原因。

ContributorsWang, Ping (Author) / Gu, Bin (Thesis advisor) / Li, Feng (Thesis advisor) / Yan, Hong (Committee member) / Arizona State University (Publisher)

Created2019

Development of Complementary Fresh-Food Systems Through the Exploration and Identification of Profit-Maximizing, Supply Chains

Description

One of the greatest 21st century challenges is meeting the needs of a growing world population expected to increase 35% by 2050 given projected trends in diets, consumption and income. This in turn requires a 70-100% improvement on current production capability, even as the world is undergoing systemic climate…

One of the greatest 21st century challenges is meeting the needs of a growing world population expected to increase 35% by 2050 given projected trends in diets, consumption and income. This in turn requires a 70-100% improvement on current production capability, even as the world is undergoing systemic climate pattern changes. This growth not only translates to higher demand for staple products, such as rice, wheat, and beans, but also creates demand for high-value products such as fresh fruits and vegetables (FVs), fueled by better economic conditions and a more health conscious consumer. In this case, it would seem that these trends would present opportunities for the economic development of environmentally well-suited regions to produce high-value products. Interestingly, many regions with production potential still exhibit a considerable gap between their current and ‘true’ maximum capability, especially in places where poverty is more common. Paradoxically, often high-value, horticultural products could be produced in these regions, if relatively small capital investments are made and proper marketing and distribution channels are created. The hypothesis is that small farmers within local agricultural systems are well positioned to take advantage of existing sustainable and profitable opportunities, specifically in high-value agricultural production. Unearthing these opportunities can entice investments in small farming development and help them enter the horticultural industry, thus expand the volume, variety and/or quality of products available for global consumption. In this dissertation, the objective is three-fold: (1) to demonstrate the hidden production potential that exist within local agricultural communities, (2) highlight the importance of supply chain modeling tools in the strategic design of local agricultural systems, and (3) demonstrate the application of optimization and machine learning techniques to strategize the implementation of protective agricultural technologies.

As part of this dissertation, a yield approximation method is developed and integrated with a mixed-integer program to estimate a region’s potential to produce non-perennial, vegetable items. This integration offers practical approximations that help decision-makers identify technologies needed to protect agricultural production, alter harvesting patterns to better match market behavior, and provide an analytical framework through which external investment entities can assess different production options.

ContributorsFlores, Hector M. (Author) / Villalobos, Rene (Thesis advisor) / Pan, Rong (Committee member) / Wu, Teresa (Committee member) / Parker, Nathan (Committee member) / Arizona State University (Publisher)

Created2017

Filtering by

Learning from asymmetric models and matched pairs

It's not all about the music: digital goods, social media, and the pressure of peers

Modeling time series data for supervised learning

Machine Learning Models for High-dimensional Biomedical Data

Understanding, Analyzing and Predicting Online User Behavior

从资源配置角度研究中国商品期货市场有效性

Development of Complementary Fresh-Food Systems Through the Exploration and Identification of Profit-Maximizing, Supply Chains