Issue
I am trying to understand how userspace, zero-copy networking libraries work in Linux.
My examples below follows the usage of AF_XDP sockets, but it should (hopefully) generalise to other libs like netmap and DPDK.
This is my understanding of how packets are read/written:
Setup
- Userspace allocates a contiguous buffer with
n
packet-sized chunks - Userspace allocates both a RX and TX ring that determines when a packet has arrived, and when one is ready to send. Elements of these rings point to chunks in the buffer.
- The buffer, RX and TX rings are registered with the kernel
(In the context of AF_XDP this corresponds to the UMEM and mmapped rx/tx rings)
Read path
- A packet arrives at the NIC
- The NIC DMA's the packet into a chunk in the buffer
- NIC kernel driver notifies userspace by putting an item on the RX ring.
- Userspace polls the ring, sees the new packet, and does something with it.
Write path
- Userspace writes a packet into the packet buffer
- Userspace notifies the kernel by putting an item on the TX ring
- NIC kernel module sends the chunk identifies by the item on the TX ring
Question
Consider an application sending data larger than one MTU, so it needs to span multiple packets. Like a big HTTP request spanning multiple TCP segments.
This data needs to be turned into a series of IP packets in the packet buffer. How can the zero-copy library enable this operation without copying subsets of the data request into each chunk in the packet buffer?
I know there are syscalls like splice
and vmsplice
that map memory without copying it, but they require using a pipe, so it's incompatible with the packet buffer (which lives in userspace).
I also thought about mmap
but it must be page aligned (which is multiple times larger than the typical MTU) and has significant setup/teardown costs which preclude it from being used on a per-packet basis. See this mailing list post from Linus.
Similarly, how does the reverse work for reading data?
Attempt at answering
Maybe the library wraps the packet buffer in some datastructure that exposes read(): bytes
and write(bytes)
methods. These automatically construct well-formed packets on-the-fly in the packet buffer.
So the application cannot say "here, take this pre-existing buffer and send it with zero copies". Instead, it needs to write that data directly into the packet buffer (using those wrapper methods) from the very beginning.
But this means everything must be deeply integrated with this data structure’s interface at the application level.
Solution
You seem to be singling out an exact use case here, namely zero-copy transmit of preexisting data using AF_XDP, e.g. high-speed static data serving via HTTP. That is indeed quite hard to do exactly for the reasons you provided - the application needs to prepare full IP packets to be sent, and the data needs to be in the umem buffers. It may be possible to chunk and copy data once into the buffers, send them and then reuse the buffers to send same data again (you'd need to adjust IP/TCP headers but the actual data would be already there).
Note that usecase I've chosen - HTTP static server - would actually probably need to be HTTPS server; given that, there wouldn't be any possibility for zero-copy since each client would need to get its own unique encrypted data. Indeed still the application would need to be able to chunk outgoing encrypted data into proper MSS-sized segments and store them in separate packets, and also fill out IP/TCP headers properly.
Proper zero-copy in AF_XDP is mostly limited to data ingress scenarios (where you'd still need to be able to process data in chunks but that is something that can be done reasonably easily), packet routing (packet received goes out into another port with some minimal in-place modifications) or packet generation (sent buffers are reused to be sent out again and again).
Now, if we venture out of realm of that AF_XDP can do, there are better(?) possibilities. Modern high-speed NICs like Mellanox CX5 have built in scatter-gather engines and can assembly outgoing packets from several disjoint memory chunks; if you can tap into that, you could just generate IP headers and "stitch" data onto them to form complete packets, chunked any way you need them to, all without copying data. I think DPDK has this ability, for one. Note you'd still need to put your data into some locked-in-physmem memory block which probably needs to be preregistered with the kernel/nic (so it can dma from it).
PS: Linux 6.6 got multi-buffer support for AF_XDP. I kinda thought it was for jumbo frames, but now I think it doesn't have to be just for that. It might probably be used for SG a normal packet as well (still, the data needs to be prepared ahead of time in some chunks in umem so it's not that better than before 6.6).
PPS: depending on your particular application, there might be ways to do zero-copy transmit with the kernel network stack - and it is great, e.g. it can offload TCP segmentation and even TLS onto NIC if it supports it. Can't beat properly made userspace stack though (simply because with it you can still leverage all the offloads like TSO or checksums and still cut away all the checks and hooks and memory management etc the kernel network stack has to do) - but it imposes certain restrictions, rather harsh restrictions sometimes, and your application needs to adapt to them.
Answered By - Andrey Turkin Answer Checked By - Senaida (WPSolving Volunteer)