Time efficient and quality effective K nearest neighbor search in high dimension space

Yu, Renwei

K-Nearest-Neighbors (KNN) search is a fundamental problem in many application domains such as database and data mining, information retrieval, machine learning, pattern recognition and plagiarism detection. Locality sensitive hash (LSH) is so far the most practical approximate KNN search algorithm…

K-Nearest-Neighbors (KNN) search is a fundamental problem in many application domains such as database and data mining, information retrieval, machine learning, pattern recognition and plagiarism detection. Locality sensitive hash (LSH) is so far the most practical approximate KNN search algorithm for high dimensional data. Algorithms such as Multi-Probe LSH and LSH-Forest improve upon the basic LSH algorithm by varying hash bucket size dynamically at query time, so these two algorithms can answer different KNN queries adaptively. However, these two algorithms need a data access post-processing step after candidates' collection in order to get the final answer to the KNN query. In this thesis, Multi-Probe LSH with data access post-processing (Multi-Probe LSH with DAPP) algorithm and LSH-Forest with data access post-processing (LSH-Forest with DAPP) algorithm are improved by replacing the costly data access post-processing (DAPP) step with a much faster histogram-based post-processing (HBPP). Two HBPP algorithms: LSH-Forest with HBPP and Multi- Probe LSH with HBPP are presented in this thesis, both of them achieve the three goals for KNN search in large scale high dimensional data set: high search quality, high time efficiency, high space efficiency. None of the previous KNN algorithms can achieve all three goals. More specifically, it is shown that HBPP algorithms can always achieve high search quality (as good as LSH-Forest with DAPP and Multi-Probe LSH with DAPP) with much less time cost (one to several orders of magnitude speedup) and same memory usage. It is also shown that with almost same time cost and memory usage, HBPP algorithms can always achieve better search quality than LSH-Forest with random pick (LSH-Forest with RP) and Multi-Probe LSH with random pick (Multi-Probe LSH with RP). Moreover, to achieve a very high search quality, Multi-Probe with HBPP is always a better choice than LSH-Forest with HBPP, regardless of the distribution, size and dimension number of the data set.

Copyright Statement

Reuse Permissions

Downloads

pdf (1.8 MB)

Details

Title

Time efficient and quality effective K nearest neighbor search in high dimension space

Contributors

Yu, Renwei (Author)
Candan, Kasim S (Thesis advisor)
Sapino, Maria L (Committee member)
Chen, Yi (Committee member)
Sundaram, Hari (Committee member)
Arizona State University (Publisher)

Date Created

2011

Subjects

Resource Type

Text

Collections this item is in

ASU Electronic Theses and Dissertations

Note

Partial requirement for: M.S., Arizona State University, 2011

Note type

thesis
Includes bibliographical references (p. 68-71)

Note type

bibliography
Field of study: Computer science

Time efficient and quality effective K nearest neighbor search in high dimension space

Details

Citation and reuse

Statement of Responsibility

Machine-readable links