Beneath the SURFace: An MRI-like View into the Life of a 21st Century Datacenter

  • Kristian Valur Laursen Olason (Creator)
  • Alexandru Uta (Leiden University) (Creator)
  • Alexandru Iosup (Creator)
  • Paul Melis (Creator)
  • Damian Podareanu (Creator)
  • Valeriu Codreanu (Creator)
  • Bas van der Vlies (Contributor)
  • Martijn Kruiten (Contributor)
  • Jaap Dijkshoorn (Contributor)

Dataset

Description

This is a trace archive of metrics collected from the Lisa cluster at SURFsara associated with the article that will be published in USENIX;login: in July 2020.

Github repository which contains documentation as well as scripts required to replicate the work from the login paper: https://github.com/sara-nl/SURFace

Real-world data can be instrumental in answering detailed questions: How do we know which assumptions regarding large-scale systems are realistic? How do we know that the systems we build are practical? How do we know which metrics are important to assess when analyzing performance? To answer such questions, we need to collect and share operational traces containing real-world, detailed data. Not only is the presence of low-level metrics significant, but they also help avoid biases through their variety. To address variety, there exist several types of archives, such as the Parallel Workloads Archive, the Grid Workloads Archive, and the Google or Microsoft logs (the Appendix gives a multi-decade overview). However, such traces mostly focus on higher-level scheduling decisions and high-level, job-based resource utilization (e.g., consumed CPU and memory). Thus, they do not provide vital information to system administrators or researchers analyzing the full-stack or the OS-level operation of datacenters. 


The traces we are sharing have the finest granularity of all other open-source traces published so far. In addition to scheduler-level logs, they contain over <em>100 low-level, server-based metrics, going to the granularity of page-faults or bytes transferred through a NIC</em>.

 

The SURF archive

Datacenters already exhibit unprecedented scale and are becoming increasingly more complex. Moreover, such computer systems have begun having a significant impact on the environment, for example, training some machine learning models has sizable carbon footprints. As our recent work on modern datacenter networks shows, low-level data is key to understanding full-stack operation, including high-level application behavior. We advocate it is time to start using such data more systematically, unlocking its potential in helping us understand how to make (datacenter) systems more efficient. We advocate that our data can contribute to a more holistic approach, looking at how the multitude of these systems work together in a large-scale datacenter. 

 

This archive contains data from the Dutch National Infrastructure, Lisa.

 

Description of the Lisa system

 

Description of the Cartesius system

 

We gather metrics, at 15-second intervals, from several data sources:


Slurm: all job, task, and scheduler related data, such as running time, queueing time, failures, servers involved in the execution, organization in partitions, and scheduling policies.

NVIDIA Management Library (NVML): per GPU, data such as power metrics, temperature, fan speed, or used memory.

IPMI: per server, data such as power metrics and temperature.

OS-level: from either <em>procfs</em>, <em>sockstat,</em> or <em>netstat</em> data: low-level OS metrics, regarding the state of each server, including CPU, disk, memory, network utilization, context switches, and interrupts. 


 

We also release other kinds of novel information, related to datacenter topology and organization.

 

The audience we envision using these metrics is composed of systems researchers, infrastructure developers and designers, system administrators, and software developers for large-scale infrastructure. The frequency of collecting data is uniquely high for open-source data, which could allow these experts unprecedented views into the operation of a real datacenter.


* Note: For the GPU metrics a number of nodes were introduced to the system in late Feb/start of March and as such these specific nodes have no data available in January and February which may cause irregularities. The github will contain code snippets that will show how to filter this data such that this is not a problem and how to graph the parquet data (this is pending update in the next few days.
Date made available2020
PublisherZenodo

Cite this