Type: Research Highlight
Title: Implicit Parallelism through Deep Language Embedding
Alexander Alexandrov, Asterios Katsifodimos, Georgi Krastev, Volker Markl
Available in: PDF
Parallel collection processing based on second-order functions such as map and reduce has been widely adopted for scalable data analysis. Initially popularized by Google, over the past decade this programming paradigm has found its way in the core APIs of parallel dataflow engines such as Hadoop’s MapReduce, Spark’s RDDs, and Flink’s DataSets. We review programming patterns typical of these APIs and discuss how they relate to the underlying parallel execution model. We argue that fixing the abstraction leaks exposed by these patterns will reduce the cost of data analysis due to improved programmer productivity. To achieve that, we first revisit the algebraic foundations of parallel collection processing. Based on that, we propose a simplified API that (i) provides proper support for nested collection processing and (ii) alleviates the need of certain second-order primitives through comprehensions – a declarative syntax akin to SQL. Finally, we present a metaprogramming pipeline that performs algebraic rewrites and physical optimizations which allow us to target parallel dataflow engines like Spark and Flink with competitive performance.
Alexander Alexandrov is a PhD student at the Database Systems and Information Management (DIMA) Group at Technische Universität Berlin, working on the Stratosphere project.
His research is focused on bridging the gap between the demands of modern Big Data analysis platforms and the need for high-level, declarative analytics languages. To achieve that, he relies on suitable algebraic foundations (like monads) that both (i) capture the essence of modern data-parallel runtimes, and (ii) allow for a high-level, declarative syntax at the language level. The goal is to design and build an embedded language for data-parallel analysis based on this algebraic foundation.
Alexander is also interested in methods and techniques for scalable data generation and benchmarking of data analysis platforms.
Asterios Katsifodimos is a senior researcher at DIMA group at the Technische Universität Berlin. His research focuses on programming models and query optimisation for scalable data analytics. Asterios received his PhD in 2013 from INRIA Saclay and Université Paris-Sud under the supervision of Ioana Manolescu. His thesis focused on materialized view-based techniques for the management of Web Data. Asterios has been a member of the High Performance Computing Lab at the University of Cyprus, where he obtained his Bsc and Msc degrees in 2009.
Georgi Krastev has a B.Sc. degree in Software Engineering from Sofia University. Currently he is studying for a master’s degree in Computer Science at TU Berlin and works as a research assistant at the Database and Information Management group there. His main areas of interest are programming languages and scalable data processing and machine learning.
Volker Markl is a Full Professor and Chair of the Database Systems and Information Management Group at the Technische Universität Berlin (TUB) and an Adjunct Full Professor at the University of Toronto. He is Director of the Intelligent Analytics for Massive Data Research Group at DFKI and Director of the Berlin Big Data Center. In addition, he serves as the Secretary of the VLDB Endowment. His current research interests include new hardware architectures for information management, scalable processing and optimization of declarative data analysis programs, and scalable data science. To date, Volker has presented over 200 invited talks in numerous industrial settings, major conferences, and research institutions worldwide. Furthermore, he has authored and published over 100 research papers at world-class scientific venues. Between 2010-2016, he was Speaker and Principal Investigator of the Stratosphere Research Unit funded by the German Research Foundation (DFG), which resulted in numerous top-tier publications, as well as the Apache Flink big data analytics system. In 2014, he was named one of Germany’s leading digital minds (Digitale K?pfe) by the German Informatics Society. Prior to joining TUB, he was a Research Staff Member and Project Leader at the IBM Almaden Research Center in San Jose, California.