Characterization of a big data storage workload in the cloud

Sacheendra Talluri, Cristina L. Abad, Alicja Łuszczak, Alexandru Iosup

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

The proliferation of big data processing platforms has led to radically different system designs, such as MapReduce and the newer Spark. Understanding the workloads of such systems facilitates tuning and could foster new designs. However, whereas MapReduce workloads have been characterized extensively, relatively little public knowledge exists about the characteristics of Spark workloads in representative environments. To address this problem, in this work we collect and analyze a 6-month Spark workload from a major provider of big data processing services, Databricks. Our analysis focuses on a number of key features, such as the long-term trends of reads and modifications, the statistical properties of reads, and the popularity of clusters and of file formats. Overall, we present numerous findings that could form the basis of new systems studies and designs. Our quantitative evidence and its analysis suggest the existence of daily and weekly load imbalances, of heavy-tailed and bursty behaviour, of the relative rarity of modifications, and of proliferation of big data specific formats.

Original languageEnglish
Title of host publicationICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering
Place of PublicationNew York, NY
PublisherAssociation for Computing Machinery, Inc
Pages33-44
Number of pages12
ISBN (Electronic)9781450362399
DOIs
Publication statusPublished - 4 Apr 2019
Event10th ACM/SPEC International Conference on Performance Engineering, ICPE 2019 - Mumbai, India
Duration: 7 Apr 201911 Apr 2019

Conference

Conference10th ACM/SPEC International Conference on Performance Engineering, ICPE 2019
CountryIndia
CityMumbai
Period7/04/1911/04/19

Fingerprint

Electric sparks
Tuning
Systems analysis
Big data

Cite this

Talluri, S., Abad, C. L., Łuszczak, A., & Iosup, A. (2019). Characterization of a big data storage workload in the cloud. In ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering (pp. 33-44). New York, NY: Association for Computing Machinery, Inc. https://doi.org/10.1145/3297663.3310302
Talluri, Sacheendra ; Abad, Cristina L. ; Łuszczak, Alicja ; Iosup, Alexandru. / Characterization of a big data storage workload in the cloud. ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering. New York, NY : Association for Computing Machinery, Inc, 2019. pp. 33-44
@inproceedings{b0f2a7a7159444128719a39c5196d0f1,
title = "Characterization of a big data storage workload in the cloud",
abstract = "The proliferation of big data processing platforms has led to radically different system designs, such as MapReduce and the newer Spark. Understanding the workloads of such systems facilitates tuning and could foster new designs. However, whereas MapReduce workloads have been characterized extensively, relatively little public knowledge exists about the characteristics of Spark workloads in representative environments. To address this problem, in this work we collect and analyze a 6-month Spark workload from a major provider of big data processing services, Databricks. Our analysis focuses on a number of key features, such as the long-term trends of reads and modifications, the statistical properties of reads, and the popularity of clusters and of file formats. Overall, we present numerous findings that could form the basis of new systems studies and designs. Our quantitative evidence and its analysis suggest the existence of daily and weekly load imbalances, of heavy-tailed and bursty behaviour, of the relative rarity of modifications, and of proliferation of big data specific formats.",
author = "Sacheendra Talluri and Abad, {Cristina L.} and Alicja Łuszczak and Alexandru Iosup",
year = "2019",
month = "4",
day = "4",
doi = "10.1145/3297663.3310302",
language = "English",
pages = "33--44",
booktitle = "ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering",
publisher = "Association for Computing Machinery, Inc",

}

Talluri, S, Abad, CL, Łuszczak, A & Iosup, A 2019, Characterization of a big data storage workload in the cloud. in ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering. Association for Computing Machinery, Inc, New York, NY, pp. 33-44, 10th ACM/SPEC International Conference on Performance Engineering, ICPE 2019, Mumbai, India, 7/04/19. https://doi.org/10.1145/3297663.3310302

Characterization of a big data storage workload in the cloud. / Talluri, Sacheendra; Abad, Cristina L.; Łuszczak, Alicja; Iosup, Alexandru.

ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering. New York, NY : Association for Computing Machinery, Inc, 2019. p. 33-44.

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

TY - GEN

T1 - Characterization of a big data storage workload in the cloud

AU - Talluri, Sacheendra

AU - Abad, Cristina L.

AU - Łuszczak, Alicja

AU - Iosup, Alexandru

PY - 2019/4/4

Y1 - 2019/4/4

N2 - The proliferation of big data processing platforms has led to radically different system designs, such as MapReduce and the newer Spark. Understanding the workloads of such systems facilitates tuning and could foster new designs. However, whereas MapReduce workloads have been characterized extensively, relatively little public knowledge exists about the characteristics of Spark workloads in representative environments. To address this problem, in this work we collect and analyze a 6-month Spark workload from a major provider of big data processing services, Databricks. Our analysis focuses on a number of key features, such as the long-term trends of reads and modifications, the statistical properties of reads, and the popularity of clusters and of file formats. Overall, we present numerous findings that could form the basis of new systems studies and designs. Our quantitative evidence and its analysis suggest the existence of daily and weekly load imbalances, of heavy-tailed and bursty behaviour, of the relative rarity of modifications, and of proliferation of big data specific formats.

AB - The proliferation of big data processing platforms has led to radically different system designs, such as MapReduce and the newer Spark. Understanding the workloads of such systems facilitates tuning and could foster new designs. However, whereas MapReduce workloads have been characterized extensively, relatively little public knowledge exists about the characteristics of Spark workloads in representative environments. To address this problem, in this work we collect and analyze a 6-month Spark workload from a major provider of big data processing services, Databricks. Our analysis focuses on a number of key features, such as the long-term trends of reads and modifications, the statistical properties of reads, and the popularity of clusters and of file formats. Overall, we present numerous findings that could form the basis of new systems studies and designs. Our quantitative evidence and its analysis suggest the existence of daily and weekly load imbalances, of heavy-tailed and bursty behaviour, of the relative rarity of modifications, and of proliferation of big data specific formats.

UR - http://www.scopus.com/inward/record.url?scp=85064825501&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85064825501&partnerID=8YFLogxK

U2 - 10.1145/3297663.3310302

DO - 10.1145/3297663.3310302

M3 - Conference contribution

SP - 33

EP - 44

BT - ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

PB - Association for Computing Machinery, Inc

CY - New York, NY

ER -

Talluri S, Abad CL, Łuszczak A, Iosup A. Characterization of a big data storage workload in the cloud. In ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering. New York, NY: Association for Computing Machinery, Inc. 2019. p. 33-44 https://doi.org/10.1145/3297663.3310302