Multi-query Optimization in Spark

Supervisor: Iman Elghandour
Semester: Fall 2019

Distributed computing platforms such as Hadoop and Spark focus on addressing the fol- lowing challenges in large systems: (1) latency, (2) scalability, and (3) fault tolerance. Dedicating computing resources for each application executed by Spark can lead to a waste of resources. Unified distributed file systems such as Alluxio has provided a platform for computing results among simultaneously running applications. However, it is up to the developers to decide on what to share. The objective of this master thesis is to optimize various applications running on a Spark platform, optimize their execution plans by autonomously finding sharing opportuni- ties, namely finding the RDDs that can be shared among these applications, and computing these shared plans once instead of multiple times for each query.

Deliverables of the master thesis project
  • An overview of the Apache Spark architecture.
  • Develop a performance model for queries executed by Spark.
  • An implementation that optimizes queries executed by Spark and identify sharing opportunities.
  • An experimental validation of the developed system.