TY - GEN
T1 - Efficient Estimation of Read Density when Caching for Big Data Processing
AU - Talluri, Sacheendra
AU - Iosup, Alexandru
PY - 2019/9/23
Y1 - 2019/9/23
N2 - Big data processing systems are becoming increasingly more present in cloud workloads. Consequently, they are starting to incorporate more sophisticated mechanisms from traditional database and distributed systems. We focus in this work on the use of caching policies, which for big data raise important new challenges. Not only they must respond to new variants of the trade-off between hit rate, response time, and the space consumed by the cache, but they must do so at possibly higher volume and velocity than web and database workloads. Previous caching policies have not been tested experimentally with big data workloads. We address these challenges in this work. We propose the Read Density family of policies, which is a principled approach to quantify the utility of cached objects through a family of utility functions that depend on the frequency of reads of an object. We further design the Approximate Histogram, which is a policy-based technique based on an array of counters. This technique promises to achieve runtime-space efficient computation of the metric required by the cache policy. We evaluate through trace-based simulation the caching policies from the Read Density family, and compare them with over ten state-of-the-art alternatives. We use two workload traces representative for big data processing, collected from commercial Spark and MapReduce deployments. While we achieve comparable performance to the state-of-art with less parameters, meaningful performance improvement for big data workloads remain elusive.
AB - Big data processing systems are becoming increasingly more present in cloud workloads. Consequently, they are starting to incorporate more sophisticated mechanisms from traditional database and distributed systems. We focus in this work on the use of caching policies, which for big data raise important new challenges. Not only they must respond to new variants of the trade-off between hit rate, response time, and the space consumed by the cache, but they must do so at possibly higher volume and velocity than web and database workloads. Previous caching policies have not been tested experimentally with big data workloads. We address these challenges in this work. We propose the Read Density family of policies, which is a principled approach to quantify the utility of cached objects through a family of utility functions that depend on the frequency of reads of an object. We further design the Approximate Histogram, which is a policy-based technique based on an array of counters. This technique promises to achieve runtime-space efficient computation of the metric required by the cache policy. We evaluate through trace-based simulation the caching policies from the Read Density family, and compare them with over ten state-of-the-art alternatives. We use two workload traces representative for big data processing, collected from commercial Spark and MapReduce deployments. While we achieve comparable performance to the state-of-art with less parameters, meaningful performance improvement for big data workloads remain elusive.
UR - http://www.scopus.com/inward/record.url?scp=85073235844&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85073235844&partnerID=8YFLogxK
U2 - 10.1109/INFCOMW.2019.8845043
DO - 10.1109/INFCOMW.2019.8845043
M3 - Conference contribution
SN - 9781728118796
T3 - INFOCOM 2019 - IEEE Conference on Computer Communications Workshops, INFOCOM WKSHPS 2019
SP - 502
EP - 507
BT - Proceedings - IEEE INFOCOM 2019
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2019 INFOCOM IEEE Conference on Computer Communications Workshops, INFOCOM WKSHPS 2019
Y2 - 29 April 2019 through 2 May 2019
ER -