TY - JOUR
T1 - The workflow trace archive
T2 - Open-access data from public and private computing infrastructures
AU - Versluis, Laurens
AU - Matha, Roland
AU - Talluri, Sacheendra
AU - Hegeman, Tim
AU - Prodan, Radu
AU - Deelman, Ewa
AU - Iosup, Alexandru
PY - 2020/9
Y1 - 2020/9
N2 - Realistic, relevant, and reproducible experiments often need input traces collected from real-world environments. In this work, we focus on traces of workflows - common in datacenters, clouds, and HPC infrastructures. We show that the state-of-the-art in using workflow-traces raises important issues: (1) the use of realistic traces is infrequent and (2) the use of realistic, open-access traces even more so. Alleviating these issues, we introduce the Workflow Trace Archive (WTA), an open-access archive of workflow traces from diverse computing infrastructures and tooling to parse, validate, and analyze traces. The WTA includes {>}48>48 million workflows captured from {>}10>10 computing infrastructures, representing a broad diversity of trace domains and characteristics. To emphasize the importance of trace diversity, we characterize the WTA contents and analyze in simulation the impact of trace diversity on experiment results. Our results indicate significant differences in characteristics, properties, and workflow structures between workload sources, domains, and fields.
AB - Realistic, relevant, and reproducible experiments often need input traces collected from real-world environments. In this work, we focus on traces of workflows - common in datacenters, clouds, and HPC infrastructures. We show that the state-of-the-art in using workflow-traces raises important issues: (1) the use of realistic traces is infrequent and (2) the use of realistic, open-access traces even more so. Alleviating these issues, we introduce the Workflow Trace Archive (WTA), an open-access archive of workflow traces from diverse computing infrastructures and tooling to parse, validate, and analyze traces. The WTA includes {>}48>48 million workflows captured from {>}10>10 computing infrastructures, representing a broad diversity of trace domains and characteristics. To emphasize the importance of trace diversity, we characterize the WTA contents and analyze in simulation the impact of trace diversity on experiment results. Our results indicate significant differences in characteristics, properties, and workflow structures between workload sources, domains, and fields.
KW - Archive
KW - Characterization
KW - Open-access
KW - Open-source
KW - Simulation
KW - Survey
KW - Traces
KW - Workflow
UR - http://www.scopus.com/inward/record.url?scp=85085175954&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85085175954&partnerID=8YFLogxK
U2 - 10.1109/TPDS.2020.2984821
DO - 10.1109/TPDS.2020.2984821
M3 - Article
AN - SCOPUS:85085175954
SN - 1045-9219
VL - 31
SP - 2170
EP - 2184
JO - IEEE Transactions on Parallel and Distributed Systems
JF - IEEE Transactions on Parallel and Distributed Systems
IS - 9
M1 - 9066946
ER -