Full metadata
Title
Shuffle Overhead Analysis for the Layered Data Abstractions
Description
Apache Spark is one of the most widely adopted open-source Big Data processing engines. High performance and ease of use for a wide class of users are some of the primary reasons for the wide adoption. Although data partitioning increases the performance of the analytics workload, its application to Apache Spark is very limited due to layered data abstractions. Once data is written to a stable storage system like Hadoop Distributed File System (HDFS), the data locality information is lost, and while reading the data back into Spark’s in-memory layer, the reading process is random which incurs shuffle overhead. This report investigates the use of metadata information that is stored along with the data itself for reducing shuffle overload in the join-based workloads. It explores the Hyperspace library to mitigate the shuffle overhead for Spark SQL applications. The article also introduces the Lachesis system to solve the shuffle overhead problem. The benchmark results show that the persistent partition and co-location techniques can be beneficial for matrix multiplication using SQL (Structured Query Language) operator along with the TPC-H analytical queries benchmark. The study concludes with a discussion about the trade-offs of using integrated stable storage to layered storage abstractions. It also discusses the feasibility of integration of the Machine Learning (ML) inference phase with the SQL operators along with cross-engine compatibility for employing data locality information.
Date Created
2021
Contributors
- Barhate, Pratik Narhar (Author)
- Zou, Jia (Thesis advisor)
- Zhao, Ming (Committee member)
- Elsayed, Mohamed Sarwat (Committee member)
- Arizona State University (Publisher)
Topical Subject
Resource Type
Extent
52 pages
Language
Copyright Statement
In Copyright
Primary Member of
Peer-reviewed
No
Open Access
No
Handle
https://hdl.handle.net/2286/R.2.N.161458
Level of coding
minimal
Cataloging Standards
Note
Partial requirement for: M.S., Arizona State University, 2021
Field of study: Computer Science
System Created
- 2021-11-16 01:16:15
System Modified
- 2021-11-30 12:51:28
- 2 years 5 months ago
Additional Formats