Scootr: Scaling R dataframes on dataflow systems

Andreas Kunft, Cosmin Basca, Tilmann Rabl, Lukas Stadler, Jens Meiners, Juan Fumero, Daniele Bonetta, Sebastian Breß, Volker Markl

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

To cope with today’s large scale of data, parallel dataflow engines such as Hadoop, and more recently Spark and Flink, have been proposed. They offer scalability and performance, but require data scientists to develop analysis pipelines in unfamiliar programming languages and abstractions. To overcome this hurdle, dataflow engines have introduced some forms of multi-language integrations, e.g., for Python and R. However, this results in data exchange between the dataflow engine and the integrated language runtime, which requires inter-process communication and causes high runtime overheads. In this paper, we present ScootR, a novel approach to execute R in dataflow systems. ScootR tightly integrates the dataflow and R language runtime by using the Truffle framework and the Graal compiler. As a result, ScootR executes R scripts directly in the Flink data processing engine, without serialization and inter-process communication. Our experimental study reveals that ScootR outperforms state-of-the-art systems by up to an order of magnitude.
Original languageEnglish
Title of host publicationSoCC 2018 - Proceedings of the 2018 ACM Symposium on Cloud Computing
PublisherAssociation for Computing Machinery, Inc
Pages288-300
ISBN (Electronic)9781450360111
DOIs
Publication statusPublished - 11 Oct 2018
Externally publishedYes
Event2018 ACM Symposium on Cloud Computing, SoCC 2018 - Carlsbad, United States
Duration: 11 Oct 201813 Oct 2018

Conference

Conference2018 ACM Symposium on Cloud Computing, SoCC 2018
Country/TerritoryUnited States
CityCarlsbad
Period11/10/1813/10/18

Funding

Acknowledgments. This work has been supported through grants by the German Science Foundation (DFG) as FOR 1306 Stratosphere, by the German Ministry for Education and Research as Berlin Big Data Center BBDC (funding mark 01IS14013A), and by Oracle Labs. This work has been supported through grants by the German Science Foundation (DFG) as FOR 1306 Stratosphere, by the German Ministry for Education and Research as Berlin Big Data Center BBDC (funding mark 01IS14013A), and by Oracle Labs.

FundersFunder number
German Ministry for Education and Research as Berlin Big Data Center BBDC01IS14013A
German Science Foundation
Oracle
Deutsche Forschungsgemeinschaft
Technische Universität Berlin
Deutsche Stiftung Sklerodermie

    Fingerprint

    Dive into the research topics of 'Scootr: Scaling R dataframes on dataflow systems'. Together they form a unique fingerprint.

    Cite this