Matching Items (2)
171809-Thumbnail Image.png
Description

Data integration involves the reconciliation of data from diverse data sources in order to obtain a unified data repository, upon which an end user such as a data analyst can run analytics sessions to explore the data and obtain useful insights. Supervised Machine Learning (ML) for data integration tasks such

Data integration involves the reconciliation of data from diverse data sources in order to obtain a unified data repository, upon which an end user such as a data analyst can run analytics sessions to explore the data and obtain useful insights. Supervised Machine Learning (ML) for data integration tasks such as ontology (schema) or entity (instance) matching requires several training examples in terms of manually curated, pre-labeled matching and non-matching schema concept or entity pairs which are hard to obtain. On similar lines, an analytics system without predictive capabilities about the impending workload can incur huge querying latencies, while leaving the onus of understanding the underlying database schema and writing a meaningful query at every step during a data exploration session on the user. In this dissertation, I will describe the human-in-the-loop Machine Learning (ML) systems that I have built towards data integration and predictive analytics. I alleviate the need for extensive prior labeling by utilizing active learning (AL) for dataintegration. In each AL iteration, I detect the unlabeled entity or schema concept pairs that would strengthen the ML classifier and selectively query the human oracle for such labels in a budgeted fashion. Thus, I make use of human assistance for ML-based data integration. On the other hand, when the human is an end user exploring data through Online Analytical Processing (OLAP) queries, my goal is to pro-actively assist the human by predicting the top-K next queries that s/he is likely to be interested in. I will describe my proposed SQL-predictor, a Business Intelligence (BI) query predictor and a geospatial query cardinality estimator with an emphasis on schema abstraction, query representation and how I adapt the ML models for these tasks. For each system, I will discuss the evaluation metrics and how the proposed systems compare to the state-of-the-art baselines on multiple datasets and query workloads.

ContributorsMeduri, Venkata Vamsikrishna (Author) / Sarwat, Mohamed (Thesis advisor) / Bryan, Chris (Committee member) / Liu, Huan (Committee member) / Ozcan, Fatma (Committee member) / Popa, Lucian (Committee member) / Arizona State University (Publisher)
Created2022
161549-Thumbnail Image.png
Description
Traditionally, databases have been categorized as either row-oriented or column-oriented databases. Row-oriented databases store each row of the table’s data contiguously onto the disk whereas column-oriented databases store each column’s data contiguously onto the disk. In recent years, columnar database management systems are becoming increasingly popular because deep and narrow

Traditionally, databases have been categorized as either row-oriented or column-oriented databases. Row-oriented databases store each row of the table’s data contiguously onto the disk whereas column-oriented databases store each column’s data contiguously onto the disk. In recent years, columnar database management systems are becoming increasingly popular because deep and narrow queries are faster on them. Hence, column-oriented databases are highly optimized to be used with analytical (OLAP) workloads (Mike Freedman 2019). That is why they are very frequently used in business intelligence (BI), data warehouses, etc., which involve working with large data sets, intensive queries and aggregated computing. As the size of data keeps growing, efficient compression of data becomes an important consideration for these databases to optimize storage as well as improve query performance. Since column-oriented databases store data of the same data type contiguously, most modern compression techniques provide better compression ratios as compared to row-oriented databases. This thesis introduces a new compression technique called SA128 for column-oriented databases that performs a column-wise compression of database tables. SA128 is a multi-stage compression technique which performs a column-wise compression followed by a table-wide compression of database tables. In the first stage, SA128 performs an analysis based on the characteristics of data (such as data type and distribution) and determines which combination of lossless compression algorithms would result in the best compression ratio. In the second phase, SA128 uses an entropy encoding technique such as rANS (Duda, J., 2013) to further improve the compression ratio.
ContributorsAnand, Sukhpreet Singh (Author) / Bansal, Ajay (Thesis advisor) / Heinrichs, Robert R (Committee member) / Gonzalez-Sanchez, Javier (Committee member) / Arizona State University (Publisher)
Created2021