Master project: Accelerating software RDMA with io_uring

Chris Broekema, ASTRON and University of Cambridge
Rob van Nieuwpoort, eScience Center & UvA

1 Introduction

Modern radio astronomy relies on large volumes of instrument data being transported to a compute facility. For security reasons, receiving network data requires the kernel to parse the link layer through application layer protocol headers before the payload is delivered to user space. The context switches between kernel and user space and the need to copy all data from kernel to user space make this expensive in terms of used CPU cycles.

To mitigate the high CPU load and to ensure high and stable throughput the use of Remote Direct Memory Access (RDMA) is currently being investigated both for the Dutch LOFAR array and the future Square Kilometre Array. While RDMA does perform well and significantly reduce the required CPU cycles for data transport, it needs hardware support, and thus introduces a hardware dependency. Furthermore, since RDMA traffic is handled in hardware, it is invisible to the kernel. Therefore, our standard monitoring and tracing tools, such as tcpdump and wireshark will not (by default) be able to see this traffic. For development and debugging a software RDMA implementation, called rdma_rxw, exists, but this is very slow.

A relatively new introduction to the Linux kernel is io_uring. This new interface allows a user space program to define submission and completion ring buffers that are shared with the kernel. In modern Linux kernel versions these shared ring buffers can be offered to about 28 different system calls. While context switches are still required, data no longer needs to be copied from kernel space to user space. This should allow for a better performing software RDMA implementation to be developed.

2 Software RDMA using io uring

This project aims to take the existing software RDMA implementation in the Linux kernel and investigate whether performance can be improved using the io_uring interface. The project will be divided into the following phases:

1. initial investigation, including:
(a) identify, build and commission a test-bed
(b) getting familiar with io_uring, liburing and rdma_rxe
(c) baseline measurements with rdma_rxe, normal Ethernet and hardware RDMA
2. identify the performance hot spot of the rdma_rxe interface (using profiling)
3. implement io_uring accelerated version of software RDMA interface
4. measure and document performance
5. write report, clean up and push code to public repository
6. (stretch goal) write a merge request for upstream Linux adoption

Considering the need to work with somewhat complex and low-level existing code, we suggest this should be a rather large project for a masters student. A small proof of concept at higher level has been developed previously. This used a convenient high-level library: liburing. It is unclear whether this library can be used for software RDMA, or if the bare base interface must be used. It is expected though that the latter will be required for an eventual merge request to be accepted.

3 Performance

Performance measurements will focus on observed throughput and CPU use
during such throughput. We suggest that in additional to this, continuous high
frequency energy measurements during throughput experiments would yield
valuable additional information.

Another aspect that can be measured is the impact of RDMA on energy efficiency, see also:

Przemyslaw Lenkiewicz, P. Chris Broekema, and Bernard Metzler:
Energy-Efficient Data Transfers in Radio Astronomy with Software UDP RDMA
The 3nd International Workshop on Innovating the Network for Data Intensive Science (INDIS) 2016