TY - GEN
T1 - Albis
T2 - 2018 USENIX Annual Technical Conference, USENIX ATC 2018
AU - Trivedi, Animesh
AU - Stuedi, Patrick
AU - Pfefferle, Jonas
AU - Schuepbach, Adrian
AU - Metzler, Bernard
PY - 2020/1/1
Y1 - 2020/1/1
N2 - Over the last decade, a variety of external file formats such as Parquet, ORC, Arrow, etc., have been developed to store large volumes of relational data in the cloud. As high-performance networking and storage devices are used pervasively to process this data in frameworks like Spark and Hadoop, we observe that none of the popular file formats are capable of delivering data access rates close to the hardware. Our analysis suggests that multiple antiquated notions about the nature of I/O in a distributed setting, and the preference for the “storage efficiency” over performance is the key reason for this gap. In this paper we present Albis, a high-performance file format for storing relational data on modern hardware. Albis is built upon two key principles: (i) reduce the CPU cost by keeping the data/metadata storage format simple; (ii) use a binary API for an efficient object management to avoid unnecessary object materialization. In our evaluation, we demonstrate that in micro-benchmarks Albis delivers 1.9 − 21.4× faster bandwidths than other formats. At the workload-level, Albis in Spark/SQL reduces the runtimes of TPC-DS queries up to a margin of 3×.
AB - Over the last decade, a variety of external file formats such as Parquet, ORC, Arrow, etc., have been developed to store large volumes of relational data in the cloud. As high-performance networking and storage devices are used pervasively to process this data in frameworks like Spark and Hadoop, we observe that none of the popular file formats are capable of delivering data access rates close to the hardware. Our analysis suggests that multiple antiquated notions about the nature of I/O in a distributed setting, and the preference for the “storage efficiency” over performance is the key reason for this gap. In this paper we present Albis, a high-performance file format for storing relational data on modern hardware. Albis is built upon two key principles: (i) reduce the CPU cost by keeping the data/metadata storage format simple; (ii) use a binary API for an efficient object management to avoid unnecessary object materialization. In our evaluation, we demonstrate that in micro-benchmarks Albis delivers 1.9 − 21.4× faster bandwidths than other formats. At the workload-level, Albis in Spark/SQL reduces the runtimes of TPC-DS queries up to a margin of 3×.
UR - http://www.scopus.com/inward/record.url?scp=85077460390&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85077460390&partnerID=8YFLogxK
M3 - Conference contribution
T3 - Proceedings of the 2018 USENIX Annual Technical Conference, USENIX ATC 2018
SP - 615
EP - 629
BT - Proceedings of the 2018 USENIX Annual Technical Conference, USENIX ATC 2018
PB - USENIX Association
Y2 - 11 July 2018 through 13 July 2018
ER -