Unsupervised Bayesian data cleaning techniques for structured data

De, Sushovan

Recent efforts in data cleaning have focused mostly on problems like data deduplication, record matching, and data standardization; few of these focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost…

Recent efforts in data cleaning have focused mostly on problems like data deduplication, record matching, and data standardization; few of these focus on fixing incorrect attribute values in tuples. Correcting values in tuples is typically performed by a minimum cost repair of tuples that violate static constraints like CFDs (which have to be provided by domain experts, or learned from a clean sample of the database). In this thesis, I provide a method for correcting individual attribute values in a structured database using a Bayesian generative model and a statistical error model learned from the noisy database directly. I thus avoid the necessity for a domain expert or master data. I also show how to efficiently perform consistent query answering using this model over a dirty database, in case write permissions to the database are unavailable. A Map-Reduce architecture to perform this computation in a distributed manner is also shown. I evaluate these methods over both synthetic and real data.

Copyright Statement

Reuse Permissions

Downloads

pdf (5 MB)

Details

Title

Unsupervised Bayesian data cleaning techniques for structured data

Contributors

De, Sushovan (Author)
Kambhampati, Subbarao (Thesis advisor)
Chen, Yi (Committee member)
Candan, K. Selcuk (Committee member)
Liu, Huan (Committee member)
Arizona State University (Publisher)

Date Created

2014

Subjects

Resource Type

Text

Collections this item is in

ASU Electronic Theses and Dissertations

Note

Partial requirement for: Ph.D., Arizona State University, 2014

Note type

thesis
Includes bibliographical references (p. 87-90)

Note type

bibliography
Field of study: Computer science

Unsupervised Bayesian data cleaning techniques for structured data

Details

Citation and reuse

Statement of Responsibility

Machine-readable links