Abstract
To cope with today’s large scale of data, parallel dataflow engines such as Hadoop, and more recently Spark and Flink, have been proposed. They offer scalability and performance, but require data scientists to develop analysis pipelines in unfamiliar programming languages and abstractions. To overcome this hurdle, dataflow engines have introduced some forms of multi-language integrations, e.g., for Python and R. However, this results in data exchange between the dataflow engine and the integrated language runtime, which requires inter-process communication and causes high runtime overheads. In this paper, we present ScootR, a novel approach to execute R in dataflow systems. ScootR tightly integrates the dataflow and R language runtime by using the Truffle framework and the Graal compiler. As a result, ScootR executes R scripts directly in the Flink data processing engine, without serialization and inter-process communication. Our experimental study reveals that ScootR outperforms state-of-the-art systems by up to an order of magnitude.
Original language | English |
---|---|
Title of host publication | SoCC 2018 - Proceedings of the 2018 ACM Symposium on Cloud Computing |
Publisher | Association for Computing Machinery, Inc |
Pages | 288-300 |
ISBN (Electronic) | 9781450360111 |
DOIs | |
Publication status | Published - 11 Oct 2018 |
Externally published | Yes |
Event | 2018 ACM Symposium on Cloud Computing, SoCC 2018 - Carlsbad, United States Duration: 11 Oct 2018 → 13 Oct 2018 |
Conference
Conference | 2018 ACM Symposium on Cloud Computing, SoCC 2018 |
---|---|
Country/Territory | United States |
City | Carlsbad |
Period | 11/10/18 → 13/10/18 |
Funding
Acknowledgments. This work has been supported through grants by the German Science Foundation (DFG) as FOR 1306 Stratosphere, by the German Ministry for Education and Research as Berlin Big Data Center BBDC (funding mark 01IS14013A), and by Oracle Labs. This work has been supported through grants by the German Science Foundation (DFG) as FOR 1306 Stratosphere, by the German Ministry for Education and Research as Berlin Big Data Center BBDC (funding mark 01IS14013A), and by Oracle Labs.
Funders | Funder number |
---|---|
German Ministry for Education and Research as Berlin Big Data Center BBDC | 01IS14013A |
German Science Foundation | |
Oracle | |
Deutsche Forschungsgemeinschaft | |
Technische Universität Berlin | |
Deutsche Stiftung Sklerodermie |