Matching Items (2)
150158-Thumbnail Image.png
Description
Multi-label learning, which deals with data associated with multiple labels simultaneously, is ubiquitous in real-world applications. To overcome the curse of dimensionality in multi-label learning, in this thesis I study multi-label dimensionality reduction, which extracts a small number of features by removing the irrelevant, redundant, and noisy information while considering

Multi-label learning, which deals with data associated with multiple labels simultaneously, is ubiquitous in real-world applications. To overcome the curse of dimensionality in multi-label learning, in this thesis I study multi-label dimensionality reduction, which extracts a small number of features by removing the irrelevant, redundant, and noisy information while considering the correlation among different labels in multi-label learning. Specifically, I propose Hypergraph Spectral Learning (HSL) to perform dimensionality reduction for multi-label data by exploiting correlations among different labels using a hypergraph. The regularization effect on the classical dimensionality reduction algorithm known as Canonical Correlation Analysis (CCA) is elucidated in this thesis. The relationship between CCA and Orthonormalized Partial Least Squares (OPLS) is also investigated. To perform dimensionality reduction efficiently for large-scale problems, two efficient implementations are proposed for a class of dimensionality reduction algorithms, including canonical correlation analysis, orthonormalized partial least squares, linear discriminant analysis, and hypergraph spectral learning. The first approach is a direct least squares approach which allows the use of different regularization penalties, but is applicable under a certain assumption; the second one is a two-stage approach which can be applied in the regularization setting without any assumption. Furthermore, an online implementation for the same class of dimensionality reduction algorithms is proposed when the data comes sequentially. A Matlab toolbox for multi-label dimensionality reduction has been developed and released. The proposed algorithms have been applied successfully in the Drosophila gene expression pattern image annotation. The experimental results on some benchmark data sets in multi-label learning also demonstrate the effectiveness and efficiency of the proposed algorithms.
ContributorsSun, Liang (Author) / Ye, Jieping (Thesis advisor) / Li, Baoxin (Committee member) / Liu, Huan (Committee member) / Mittelmann, Hans D. (Committee member) / Arizona State University (Publisher)
Created2011
Description
This dissertation studies how forecasting performance can be improved in big data. The first chapter with Seung C. Ahn considers Partial Least Squares (PLS) estimation of a time-series forecasting model with data containing a large number of time series observations of many predictors. In the model, a subset or a

This dissertation studies how forecasting performance can be improved in big data. The first chapter with Seung C. Ahn considers Partial Least Squares (PLS) estimation of a time-series forecasting model with data containing a large number of time series observations of many predictors. In the model, a subset or a whole set of the latent common factors in predictors determine a target variable. First, the optimal number of the PLS factors for forecasting could be smaller than the number of the common factors relevant for the target variable. Second, as more than the optimal number of PLS factors is used, the out-of-sample explanatory power of the factors could decrease while their in-sample power may increase. Monte Carlo simulation results also confirm these asymptotic results. In addition, simulation results indicate that the out-of-sample forecasting power of the PLS factors is often higher when a smaller than the asymptotically optimal number of factors are used. Finally, the out-of-sample forecasting power of the PLS factors often decreases as the second, third, and more factors are added, even if the asymptotically optimal number of the factors is greater than one. The second chapter studies the predictive performance of various factor estimations comprehensively. Big data that consist of major U.S. macroeconomic and finance variables, are constructed. 148 target variables are forecasted, using 7 factor estimation methods with 11 information criteria. First, the number of factors used in forecasting is important and Incorporating more factors does not always provide better forecasting performance. Second, using consistently estimated number of factors does not necessarily improve predictive performance. The first PLS factor, which is not theoretically consistent, very often shows strong forecasting performance. Third, there is a large difference in the forecasting performance across different information criteria, even when the same factor estimation method is used. Therefore, the choice of factor estimation method, as well as the information criterion, is crucial in forecasting practice. Finally, the first PLS factor yields forecasting performance very close to the best result from the total combinations of the 7 factor estimation methods and 11 information criteria.
ContributorsBae, Juhui (Author) / Ahn, Seung (Thesis advisor) / Pruitt, Seth (Committee member) / Kuminoff, Nicolai (Committee member) / Ferraro, Domenico (Committee member) / Arizona State University (Publisher)
Created2021