Abstract
In today's data center, a diverse mix of throughput-sensitive long flows and delay-sensitive short flows are commonly presented. However, commodity switches used in a typical data center network are usually shallow-buffered for the sake of reducing queueing delay and deployment cost. The direct outcome is that the queue occupation by long flows could potentially block the transmission of delay-sensitive short flows, leading to degraded performance. Congestion can also be caused by the synchronization of multiple TCP connections for short flows, as typically seen in the partition/aggregate traffic pattern. The congestion is usually transient and any end-device intervention through the timeout-based pathway would result in suboptimal performance. While multiple end-to-end transport-layer solutions have been proposed, none of them has tackled the real challenge: reliable transmission in the network. In this paper, we fill this gap by presenting PABO — a novel link-layer design that can mitigate congestion by temporarily bouncing packets to upstream switches. PABO's design fulfills the following goals: (i) providing per-flow based flow control on the link layer, (ii) handling transient congestion without the intervention of end devices, and (iii) gradually back propagating the congestion signal to the source when the network is not capable to handle the congestion. We present the detailed design of PABO and complete a proof-of-concept implementation. We discuss the impact of system parameters on packet out-of-order delivery and conduct extensive experiments to prove the effectiveness of PABO. We examine the basic properties of PABO using a tree-based topology, and further evaluate the overall performance of PABO using a realistic Fattree topology for data center networks. Experiment results show that PABO can provide prominent advantage of mitigating transient congestions and can achieve significant gain on flow completion time.
Original language | English |
---|---|
Pages (from-to) | 1-14 |
Number of pages | 14 |
Journal | Computer Communications |
Volume | 140-141 |
Early online date | 16 Apr 2019 |
DOIs | |
Publication status | Published - May 2019 |
Funding
A preliminary version of the paper has appeared at IEEE ICC 2017 Shi et al. (2017) [1]. This work was partially supported by the National Key Research and Development Program of China (grant number 2017YFB1010001), the National Natural Science Foundation of China (grant numbers 61520106005, 61761136014). L. Wang was funded by the German Research Foundation (DFG), Germany project 392046569 and by the subproject C7 within the DFG Collaborative Research Center 1053 (MAKI). The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Funders | Funder number |
---|---|
Deutsche Forschungsgemeinschaft | 1053 (MAKI), 392046569 |
National Natural Science Foundation of China | 61761136014, 61520106005 |
National Basic Research Program of China (973 Program) | 2017YFB1010001) |