Search Content

Matching Items (3)

Filtering by

All Subjects: Entity Matching
Creators: Sarwat, Mohamed
Member of: ASU Electronic Theses and Dissertations
Status: Published

Human-in-the-Loop Machine Learning Systems for Data Integration and Predictive Analytics

Description

Data integration involves the reconciliation of data from diverse data sources in order to obtain a unified data repository, upon which an end user such as a data analyst can run analytics sessions to explore the data and obtain useful insights. Supervised Machine Learning (ML) for data integration tasks such as ontology (schema) or entity (instance) matching requires several training examples in terms of manually curated, pre-labeled matching and non-matching schema concept or entity pairs which are hard to obtain. On similar lines, an analytics system without predictive capabilities about the impending workload can incur huge querying latencies, while leaving the onus of understanding the underlying database schema and writing a meaningful query at every step during a data exploration session on the user. In this dissertation, I will describe the human-in-the-loop Machine Learning (ML) systems that I have built towards data integration and predictive analytics. I alleviate the need for extensive prior labeling by utilizing active learning (AL) for dataintegration. In each AL iteration, I detect the unlabeled entity or schema concept pairs that would strengthen the ML classifier and selectively query the human oracle for such labels in a budgeted fashion. Thus, I make use of human assistance for ML-based data integration. On the other hand, when the human is an end user exploring data through Online Analytical Processing (OLAP) queries, my goal is to pro-actively assist the human by predicting the top-K next queries that s/he is likely to be interested in. I will describe my proposed SQL-predictor, a Business Intelligence (BI) query predictor and a geospatial query cardinality estimator with an emphasis on schema abstraction, query representation and how I adapt the ML models for these tasks. For each system, I will discuss the evaluation metrics and how the proposed systems compare to the state-of-the-art baselines on multiple datasets and query workloads.

ContributorsMeduri, Venkata Vamsikrishna (Author) / Sarwat, Mohamed (Thesis advisor) / Bryan, Chris (Committee member) / Liu, Huan (Committee member) / Ozcan, Fatma (Committee member) / Popa, Lucian (Committee member) / Arizona State University (Publisher)

Created2022

A Framework for Interactive Geospatial Map Cleaning using GPS Trajectories

Description

A volunteered geographic information system, e.g., OpenStreetMap (OSM), collects data from volunteers to generate geospatial maps. To keep the map consistent, volunteers are expected to perform the tedious task of updating the underlying geospatial data at regular intervals. Such a map curation step takes time and considerable human effort. In this thesis, we propose a framework that improves the process of updating geospatial maps by automatically identifying road changes from user-generated GPS traces. Since GPS traces can be sparse and noisy, the proposed framework validates the map changes with the users before propagating them to a publishable version of the map. The proposed framework achieves up to four times faster map matching performance than the state-of-the-art algorithms with only 0.1-0.3% accuracy loss.

ContributorsVementala, Nikhil (Author) / Papotti, Paolo (Thesis advisor) / Sarwat, Mohamed (Thesis advisor) / Kasim, Selçuk Candan (Committee member) / Arizona State University (Publisher)

Created2017

GEM: An Efficient Entity Matching Framework for Geospatial Data

Description

The use of spatial data has become very fundamental in today's world. Ranging from fitness trackers to food delivery services, almost all application records users' location information and require clean geospatial data to enhance various application features. As spatial data flows in from heterogeneous sources various problems arise. The study of entity matching has been a fervent step in the process of producing clean usable data. Entity matching is an amalgamation of various sub-processes including blocking and matching. At the end of an entity matching pipeline, we get deduplicated records of the same real-world entity. Identifying various mentions of the same real-world locations is known as spatial entity matching. While entity matching received significant interest in the field of relational entity matching, the same cannot be said about spatial entity matching. In this dissertation, I build an end-to-end Geospatial Entity Matching framework, GEM, exploring spatial entity matching from a novel perspective. In the current state-of-the-art systems spatial entity matching is only done on one type of geometrical data variant. Instead of confining to matching spatial entities of only point geometry type, I work on extending the boundaries of spatial entity matching to match the more generic polygon geometry entities as well. I propose a methodology to provide support for three entity matching scenarios across different geometrical data types: point X point, point X polygon, polygon X polygon. As mentioned above entity matching consists of various steps but blocking, feature vector creation, and classification are the core steps of the system. GEM comprises an efficient and lightweight blocking technique, GeoPrune, that uses the geohash encoding mechanism to prune away the obvious non-matching spatial entities. Geohashing is a technique to convert a point location coordinates to an alphanumeric code string. This technique proves to be very effective and swift for the blocking mechanism. I leverage the Apache Sedona engine to create the feature vectors. Apache Sedona is a spatial database management system that holds the capacity of processing spatial SQL queries with multiple geometry types without compromising on their original coordinate vector representation. In this step, I re-purpose the spatial proximity operators (SQL queries) in Apache Sedona to create spatial feature dimensions that capture the proximity between a geospatial entity pair. The last step of an entity matching process is matching or classification. The classification step in GEM is a pluggable component, which consumes the feature vector for a spatial entity pair and determines whether the geolocations match or not. The component provides 3 machine learning models that consume the same feature vector and provide a label for the test data based on the training. I conduct experiments with the three classifiers upon multiple large-scale geospatial datasets consisting of both spatial and relational attributes. Data considered for experiments arrives from heterogeneous sources and we pre-align its schema manually. GEM achieves an F-measure of 1.0 for a point X point dataset with 176k total pairs, which is 42% higher than a state-of-the-art spatial EM baseline. It achieves F-measures of 0.966 and 0.993 for the point X polygon dataset with 302M total pairs, and the polygon X polygon dataset with 16M total pairs respectively.

ContributorsShah, Setu Nilesh (Author) / Sarwat, Mohamed (Thesis advisor) / Pedrielli, Giulia (Committee member) / Boscovic, Dragan (Committee member) / Arizona State University (Publisher)

Created2021