Sendfile with io_uring (almost)

Inspired by the flurry of discussions around io_uring development, I wanted to try writing a basic efficient HTTP file server with liburing, and especially with sendfile functionality, since it was “left as an exercise for the reader”.


Blocking vs Non-Blocking

In brief (because no one has ever written a server with io_uring /s), dealing with I/O is usually synchronous (blocking) or asynchronous (non-blocking). ~Blocking from the perspective of the execution context (yielding coroutines notwithstanding).

In a blocking model, the application blocks inline on system calls, halting/waiting for the OS to return requested resources before proceeding.

The psuedo code for a blocking TCP server might look like:

while true:

  # accept connection
  fd = accept(listen.fd, ...

  # read request
  read(fd, ...
  # handle request

  # write response back
  write(fd, ...

As opposed to an asynchronous or non-blocking service, which wakes up on events like: “is readable”, “is writeable”, timed-out etc. The event based program services requests per event with something like a state machine -picking up where it last left off, before the OS told the program to try again later (EAGAIN/EWOULDBLOCK).

The general shape of an async TCP server might be:

while true:

  # wait forever for events
  events = select(...)
  for event in events

    if event.type == READABLE:
      if event.fd == listen.fd:
        fd = accept(listen.fd, ...
      else:
        read(event.fd, ...
        # handle request
        # start write response
        write(event.fd,...

    elif event.type == WRITEABLE:
      write(event.fd,...
...  

Submission Queues and Completion Queues

img Source: https://medium.com/nttlabs/rust-async-with-io-uring-db3fa2642dd4

io_uring (and especially liburing) feels like the asynchronous paradigm, of issuing non-blocking I/O calls, but the key difference being instead of making the calls directly, I/O syscalls are “submitted” to a “queue”, and the application can block pending “completion” of these requested syscalls.

A TCP server w/ io_uring (liburing) might look like

# prime the pump with a first submission
sqe = io_uring_get_sqe(&ring)
io_uring_prep_accept(sqe, ...
io_uring_submit(ring)

# wait forever for completion queue entries (cqe)
while (io_uring_wait_cqe(&ring, cqe,... 

  if cqe.type == ACCEPT:
    # submit read request
    sqe = io_uring_get_sqe(&ring)
    io_uring_prep_readv(sqe, ...
    io_uring_submit(ring)

    # resubmit accept
    sqe = io_uring_get_sqe(&ring)
    io_uring_prep_accept(sqe, ...
    io_uring_submit(ring)

  if cqe.type == READ:
    # handle request
    # submit write request (for response)
    sqe = io_uring_get_sqe(&ring)
    io_uring_prep_writev(sqe, ...
    io_uring_submit(ring)
  ...

  # mark entry as seen -return for reuse in ring buffer
  io_uring_cqe_seen(&ring, cqe);

The application chains submissions, and in some cases resubmits (for accept) to the submission queue and waits for any completed/failed calls from the completion queue.

Splicing w/ io_uring

According to Jens Axboe author of both splice(2) and io_uring (and liburing) with regard to sendfile(2) support for io_uring:

“As soon as the splice stuff is integrated, you’ll have just that. When I initially wrote splice, at the same time I turned sendfile() into a simple wrapper around it. So if you have splice, you have sendfile as well.”

splice “moves data between a file descriptor and a pipe without a round trip to user space”. In order to copy data between a file (file descriptor) and a client connection, we’ll need a pipe(2) in between.

The implementation in psuedo code could be:

# create pipe
pipe(&mypipe)

# splice from file to the "write end of the pipe" [1]
splice(file_fd, mypipe[1], file_size)

# splice from the "read end of the pipe" [0] to client connection
splice(mypipe[0], conn_fd, file_size)

I implemented a basic blocking version of this in my code, but it’s possible to get non-blocking behavior with splice with the flag SPLICE_F_NONBLOCK. I think this might require O_NONBLOCK to be specified on the pipe descriptors as well.

Coalescing and Chaining

One of the performance advantages from using io_uring comes from being able to submit multiple syscalls for a given single io_uring_enter, meaning the user application process makes fewer context switches.

For the HTTP file server, I’d like to, in a single submission, send the response headers and kick off the sending of the body data from a file on disk. The caveat with this approach of submitting multiple calls is that they’re not guaranteed by the API to run in order by default. This wouldn’t be great in this case to send body data prior to the response headers…

To enforce ordering between multiple calls set the IOSQE_IO_LINK flag in the submission queue entry per entry to chain submissions together until the last entry in the chain.

To submit 3 chained calls:

# first submission
sqe = io_uring_get_sqe(ring);
io_uring_prep_xxx(sqe, ...
sqe->flags |= IOSQE_IO_LINK;
...
# second submission
sqe = io_uring_get_sqe(ring);
io_uring_prep_xxx(sqe, ...
sqe->flags |= IOSQE_IO_LINK;
...
# third submission
sqe = io_uring_get_sqe(ring);
io_uring_prep_xxx(sqe, ...
# do NOT set flag here -since end of chain
...
# submit all previous entries
io_uring_submit(ring);

Running

Running the file server and curl ing:

# >curl localhost:12345/index.html -v
~/gproj/experiments/hignx>./hignx_uring --debug
hignx_uring.c:main.432: accept(5)
hignx_uring.c:main.611: read(89)
hignx_uring.c:main.478: statx(0)
hignx_uring.c:main.514: open(6)
hignx_uring.c:main.674: send(142)
hignx_uring.c:main.694: splice(361)
hignx_uring.c:main.694: splice(361)
hignx_uring.c:main.611: read(0)
hignx_uring.c:main.714: close(0)

Almost but not quite…

It’s almost possible to write an HTTP server with sendfile behavior using just io_uring with the current single exception of pipe2, to create a pipe between the file object and the client connection object.

# stracing server serving client requests
# >curl localhost:12345/index.html
>strace ./hignx_uring
...
io_uring_enter(4, 1, 0, 0, NULL, 8)     = 1
io_uring_enter(4, 1, 0, 0, NULL, 8)     = 1
io_uring_enter(4, 1, 0, 0, NULL, 8)     = 1
io_uring_enter(4, 1, 0, 0, NULL, 8)     = 1
pipe2([58, 59], 0)                      = 0
io_uring_enter(4, 1, 0, 0, NULL, 8)     = 1
io_uring_enter(4, 1, 0, 0, NULL, 8)     = 1
...
io_uring_enter(4, 1, 0, 0, NULL, 8)     = 1
io_uring_enter(4, 1, 0, 0, NULL, 8)     = 1
io_uring_enter(4, 1, 0, 0, NULL, 8)     = 1
pipe2([61, 62], 0)                      = 0
io_uring_enter(4, 1, 0, 0, NULL, 8)     = 1
io_uring_enter(4, 1, 0, 0, NULL, 8)     = 1
...

Aside from the direct pipe2 call, a server written like this could be pretty efficient, and cut down context switching if calls can be coalesced.

Proper high performance usage of io_uring (and liburing) appears to still be tricky, especially in the context of reactors and using timers/timeouts w/ io_uring_submit_and_wait_timeout.

See issues:


Future Directions

pipe2

The HTTP server example above almost avoids calling syscalls directly with the exception of pipe2 to create the pipe necessary to splice data between the file fd and the client connection fd. io_uring support for creating system IPC resources seems to be on it’s way with the recent addition of IORING_OP_SOCKET (ie socket(2) support). So pipe2 could be forthcoming, or better yet even, native sendfile support?..

kTLS

In reality, IPC with anything other than localhost would probably require a layer of security, ie ipsec, TLS etc. In the example of a network proxy or an HTTP file server, to be efficient about copying with fewer context switches, Kernel TLS Offload (kTLS) could be used to encrypt/decrypt directly in the kernel. The kLoop project is an example of a project using both io_uring and kTLS

Link to code: https://github.com/tinselcity/experiments/blob/master/hignx/hignx_uring.c

Thanks so much to Jacky Yin for their post clarifying splice usage with io_uring

References

Written on April 24, 2023