Using model checking to analyze the system behavior of the LHC production grid

D. Remenska, T.A.C. Willemse, K. Verstoep, J.A. Templon, H.E. Bal

Research output: Contribution to JournalArticleAcademicpeer-review

Abstract

DIRAC (Distributed Infrastructure with Remote Agent Control) is the grid solution designed to support production activities as well as user data analysis for the Large Hadron Collider "beauty" experiment. It consists of cooperating distributed services and a plethora of light-weight agents delivering the workload to the grid resources. Services accept requests from agents and running jobs, while agents actively fulfill specific goals. Services maintain database back-ends to store dynamic state information of entities such as jobs, queues, or requests for data transfer. Agents continuously check for changes in the service states and react to these accordingly. The logic of each agent is rather simple; the main source of complexity lies in their cooperation. These agents run concurrently and communicate using the services' databases as a shared memory for synchronizing the state transitions. Despite the effort invested in making DIRAC reliable, entities occasionally get into inconsistent states. Tracing and fixing such behaviors is difficult, given the inherent parallelism among the distributed components and the size of the implementation. In this paper we present an analysis of DIRAC with mCRL2, process algebra with data. We have reverse engineered two critical and related DIRAC subsystems, and subsequently modeled their behavior with the mCRL2 toolset. This enabled us to easily locate race conditions and livelocks which were confirmed to occur in the real system. We further formalized and verified several behavioral properties of the two modeled subsystems. © 2013 Elsevier B.V. All rights reserved.
Original languageEnglish
Pages (from-to)2239-2251
JournalFuture Generation Computer Systems
Volume29
Issue number8
DOIs
Publication statusPublished - 2013

Fingerprint

Model checking
Hazards and race conditions
Colliding beam accelerators
Data transfer
Algebra
Data storage equipment

Cite this

Remenska, D. ; Willemse, T.A.C. ; Verstoep, K. ; Templon, J.A. ; Bal, H.E. / Using model checking to analyze the system behavior of the LHC production grid. In: Future Generation Computer Systems. 2013 ; Vol. 29, No. 8. pp. 2239-2251.
@article{d45a8ead6add428789c04d19baca6139,
title = "Using model checking to analyze the system behavior of the LHC production grid",
abstract = "DIRAC (Distributed Infrastructure with Remote Agent Control) is the grid solution designed to support production activities as well as user data analysis for the Large Hadron Collider {"}beauty{"} experiment. It consists of cooperating distributed services and a plethora of light-weight agents delivering the workload to the grid resources. Services accept requests from agents and running jobs, while agents actively fulfill specific goals. Services maintain database back-ends to store dynamic state information of entities such as jobs, queues, or requests for data transfer. Agents continuously check for changes in the service states and react to these accordingly. The logic of each agent is rather simple; the main source of complexity lies in their cooperation. These agents run concurrently and communicate using the services' databases as a shared memory for synchronizing the state transitions. Despite the effort invested in making DIRAC reliable, entities occasionally get into inconsistent states. Tracing and fixing such behaviors is difficult, given the inherent parallelism among the distributed components and the size of the implementation. In this paper we present an analysis of DIRAC with mCRL2, process algebra with data. We have reverse engineered two critical and related DIRAC subsystems, and subsequently modeled their behavior with the mCRL2 toolset. This enabled us to easily locate race conditions and livelocks which were confirmed to occur in the real system. We further formalized and verified several behavioral properties of the two modeled subsystems. {\circledC} 2013 Elsevier B.V. All rights reserved.",
author = "D. Remenska and T.A.C. Willemse and K. Verstoep and J.A. Templon and H.E. Bal",
year = "2013",
doi = "10.1016/j.future.2013.06.004",
language = "English",
volume = "29",
pages = "2239--2251",
journal = "Future Generation Computer Systems",
issn = "0167-739X",
publisher = "Elsevier",
number = "8",

}

Using model checking to analyze the system behavior of the LHC production grid. / Remenska, D.; Willemse, T.A.C.; Verstoep, K.; Templon, J.A.; Bal, H.E.

In: Future Generation Computer Systems, Vol. 29, No. 8, 2013, p. 2239-2251.

Research output: Contribution to JournalArticleAcademicpeer-review

TY - JOUR

T1 - Using model checking to analyze the system behavior of the LHC production grid

AU - Remenska, D.

AU - Willemse, T.A.C.

AU - Verstoep, K.

AU - Templon, J.A.

AU - Bal, H.E.

PY - 2013

Y1 - 2013

N2 - DIRAC (Distributed Infrastructure with Remote Agent Control) is the grid solution designed to support production activities as well as user data analysis for the Large Hadron Collider "beauty" experiment. It consists of cooperating distributed services and a plethora of light-weight agents delivering the workload to the grid resources. Services accept requests from agents and running jobs, while agents actively fulfill specific goals. Services maintain database back-ends to store dynamic state information of entities such as jobs, queues, or requests for data transfer. Agents continuously check for changes in the service states and react to these accordingly. The logic of each agent is rather simple; the main source of complexity lies in their cooperation. These agents run concurrently and communicate using the services' databases as a shared memory for synchronizing the state transitions. Despite the effort invested in making DIRAC reliable, entities occasionally get into inconsistent states. Tracing and fixing such behaviors is difficult, given the inherent parallelism among the distributed components and the size of the implementation. In this paper we present an analysis of DIRAC with mCRL2, process algebra with data. We have reverse engineered two critical and related DIRAC subsystems, and subsequently modeled their behavior with the mCRL2 toolset. This enabled us to easily locate race conditions and livelocks which were confirmed to occur in the real system. We further formalized and verified several behavioral properties of the two modeled subsystems. © 2013 Elsevier B.V. All rights reserved.

AB - DIRAC (Distributed Infrastructure with Remote Agent Control) is the grid solution designed to support production activities as well as user data analysis for the Large Hadron Collider "beauty" experiment. It consists of cooperating distributed services and a plethora of light-weight agents delivering the workload to the grid resources. Services accept requests from agents and running jobs, while agents actively fulfill specific goals. Services maintain database back-ends to store dynamic state information of entities such as jobs, queues, or requests for data transfer. Agents continuously check for changes in the service states and react to these accordingly. The logic of each agent is rather simple; the main source of complexity lies in their cooperation. These agents run concurrently and communicate using the services' databases as a shared memory for synchronizing the state transitions. Despite the effort invested in making DIRAC reliable, entities occasionally get into inconsistent states. Tracing and fixing such behaviors is difficult, given the inherent parallelism among the distributed components and the size of the implementation. In this paper we present an analysis of DIRAC with mCRL2, process algebra with data. We have reverse engineered two critical and related DIRAC subsystems, and subsequently modeled their behavior with the mCRL2 toolset. This enabled us to easily locate race conditions and livelocks which were confirmed to occur in the real system. We further formalized and verified several behavioral properties of the two modeled subsystems. © 2013 Elsevier B.V. All rights reserved.

U2 - 10.1016/j.future.2013.06.004

DO - 10.1016/j.future.2013.06.004

M3 - Article

VL - 29

SP - 2239

EP - 2251

JO - Future Generation Computer Systems

JF - Future Generation Computer Systems

SN - 0167-739X

IS - 8

ER -