An information based optimal subdata selection algorithm for big data linear regression and a suitable variable selection algorithm

Zheng, Yi

This article proposes a new information-based subdata selection (IBOSS) algorithm, Squared Scaled Distance Algorithm (SSDA). It is based on the invariance of the determinant of the information matrix under orthogonal transformations, especially rotations. Extensive simulation results show that the new…

This article proposes a new information-based subdata selection (IBOSS) algorithm, Squared Scaled Distance Algorithm (SSDA). It is based on the invariance of the determinant of the information matrix under orthogonal transformations, especially rotations. Extensive simulation results show that the new IBOSS algorithm retains nice asymptotic properties of IBOSS and gives a larger determinant of the subdata information matrix. It has the same order of time complexity as the D-optimal IBOSS algorithm. However, it exploits the advantages of vectorized calculation avoiding for loops and is approximately 6 times as fast as the D-optimal IBOSS algorithm in R. The robustness of SSDA is studied from three aspects: nonorthogonality, including interaction terms and variable misspecification. A new accurate variable selection algorithm is proposed to help the implementation of IBOSS algorithms when a large number of variables are present with sparse important variables among them. Aggregating random subsample results, this variable selection algorithm is much more accurate than the LASSO method using full data. Since the time complexity is associated with the number of variables only, it is also very computationally efficient if the number of variables is fixed as n increases and not massively large. More importantly, using subsamples it solves the problem that full data cannot be stored in the memory when a data set is too large.

Copyright Statement

Reuse Permissions

Downloads

pdf (651.7 KB)

Details

Title

An information based optimal subdata selection algorithm for big data linear regression and a suitable variable selection algorithm

Contributors

Zheng, Yi (Author)
Stufken, John (Thesis advisor)
Reiser, Mark R. (Committee member)
McCulloch, Robert (Committee member)
Arizona State University (Publisher)

Date Created

2017

Subjects

Resource Type

Text

Collections this item is in

ASU Electronic Theses and Dissertations

Note

Partial requirement for: M.S., Arizona State University, 2017

Note type

thesis
Includes bibliographical references (page 40)

Note type

bibliography
Field of study: Statistics

An information based optimal subdata selection algorithm for big data linear regression and a suitable variable selection algorithm

Details

Citation and reuse

Statement of Responsibility

Machine-readable links