Abstract
Large-scale computing infrastructures, in particular cloud datacenters, have become a backbone of modern society. Their usage, presence, and scale are increasing rapidly, something that will continue in the foreseeable future. In line with Jevons’ paradox, as more computing power becomes available and (economically) accessible, new and existing domains have correspondingly become more demanding about computing power. The scientific, engineering, and industry domains have all explored and embraced the increase in computational power. Examples include weather forecasting (scientific domain), computational fluid dynamics (engineering domain), and drug discovery (industrial/scientific domains).
At the same time, datacenters have received more scrutiny from the public and elected officials. Their physical space required, energy and water consumption, and carbon footprint have been increasing and thus received more attention. It is therefore vital that we improve our understanding of these systems, and gain new insights that can be leveraged to improve the performance of these systems and/or reduce their resource consumption, especially since it is desired – and even enforced by some countries/unions – that our society as a whole reduces electricity and fossil fuel consumption. However, there is currently a gap in research focusing on improving our understanding of these systems. There is a lack of open-source data of the workloads and machine data of these datacenters. The data that are available often only contain a fraction of data that today’s systems can offer us. This prevents us from understanding why these systems exhibit the performance they do and how to improve it. Other works only consider a single application domain, which raises questions around their applicability in other domains. In addition, the tandem of machine and workload data are rarely analyzed and stored, and never using fine-grained, low-level metrics. Understanding what was executed where, and the ability to observe the behaviors of both the jobs and the machine(s) in detail during execution, can deepen our understanding of these systems. Finally, there is the reproducibility crisis in academia: in some cases, results of prior work cannot be reproduced or independently verified. This not only hampers the adoption of proposed solutions, it also prevents academia gaining trust that their work is meaningful and applicable elsewhere.
To bridge this gap, in this thesis we survey, create, and discuss new and existing approaches to schedule jobs inside these computing infrastructures, and we investigate real-world workloads that have been processed by such infrastructures. Most of our work focuses on a specific type of job structure: workflows. Workflows are used by many domains and fields to represent dependencies between tasks. The added complexity and the popularity makes this type of job appealing for research as improvements can likely be made and their impact profound.
Original language | English |
---|---|
Qualification | PhD |
Awarding Institution |
|
Supervisors/Advisors |
|
Award date | 10 Dec 2024 |
DOIs | |
Publication status | Published - 10 Dec 2024 |