TY - GEN
T1 - An Empirical Study on the Energy Usage and Performance of Pandas and Polars Data Analysis Python Libraries
AU - Nahrstedt, Felix
AU - Karmouche, Mehdi
AU - Bargieł, Karolina
AU - Banijamali, Pouyeh
AU - Nalini Pradeep Kumar, Apoorva
AU - Malavolta, Ivano
N1 - Publisher Copyright:
© 2024 ACM.
PY - 2024/6/18
Y1 - 2024/6/18
N2 - Context. Python's growing popularity in data analysis and the contemporary emphasis on energy-efficient software tools necessitate an investigation into the energy implications of data operations, particularly in resource-intensive domains like data science. Goal. We aim to assess the energy usage of Pandas, a widely-used Python data manipulation library, and Polars, a Rust-based library known for its performance. The study aims to provide insights for data scientists by identifying scenarios where one library outperforms the other in terms of energy usage, while exploring the possible correlations between energy and performance metrics. Method. We performed four separate experiment blocks including 8 Data Analysis Tasks (DATs) from an official TPCH Benchmark done by Polars and 6 Synthetic DATs. Both DATs groups are run with small and large dataframes and for both libraries. Results. Polars is more energy-efficient than Pandas when manipulating large dataframes. For small dataframes, the TPCH Benchmarking DATs does not show significant differences, while for the Synthetic DATs, Polars performs significantly better. We identified strong positive correlations between energy usage and execution time, as well as memory usage for Pandas, while Polars did not show significant memory usage correlations for the majority of runs. There is a significantly negative correlation between energy usage and CPU usage for Pandas. Conclusions. We recommend using Polars for energy-efficient and fast data analysis, emphasizing the importance of CPU core utilization in library selection.
AB - Context. Python's growing popularity in data analysis and the contemporary emphasis on energy-efficient software tools necessitate an investigation into the energy implications of data operations, particularly in resource-intensive domains like data science. Goal. We aim to assess the energy usage of Pandas, a widely-used Python data manipulation library, and Polars, a Rust-based library known for its performance. The study aims to provide insights for data scientists by identifying scenarios where one library outperforms the other in terms of energy usage, while exploring the possible correlations between energy and performance metrics. Method. We performed four separate experiment blocks including 8 Data Analysis Tasks (DATs) from an official TPCH Benchmark done by Polars and 6 Synthetic DATs. Both DATs groups are run with small and large dataframes and for both libraries. Results. Polars is more energy-efficient than Pandas when manipulating large dataframes. For small dataframes, the TPCH Benchmarking DATs does not show significant differences, while for the Synthetic DATs, Polars performs significantly better. We identified strong positive correlations between energy usage and execution time, as well as memory usage for Pandas, while Polars did not show significant memory usage correlations for the majority of runs. There is a significantly negative correlation between energy usage and CPU usage for Pandas. Conclusions. We recommend using Polars for energy-efficient and fast data analysis, emphasizing the importance of CPU core utilization in library selection.
UR - http://www.scopus.com/inward/record.url?scp=85197419967&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85197419967&partnerID=8YFLogxK
U2 - 10.1145/3661167.3661203
DO - 10.1145/3661167.3661203
M3 - Conference contribution
AN - SCOPUS:85197419967
T3 - ACM International Conference Proceeding Series
SP - 58
EP - 68
BT - EASE 2024
PB - Association for Computing Machinery
T2 - 28th International Conference on Evaluation and Assessment in Software Engineering, EASE 2024
Y2 - 18 June 2024 through 21 June 2024
ER -