Seamless file descriptor transfer between processes with pidfd and pidfd_getfd

A while ago, I wrote about how file descriptors can be transferred over Unix Domain Sockets between processes, when a parent child relationship doesn’t exist between the two processes. One of the use cases for file descriptor transfer between processes is during deployment of network proxies that handle ingress traffic. However, the APIs offered by the kernel for file descriptor transfer between processes have been awkward to use and riddled with a number of gotchas.

On newer versions of Linux (5.6 and above), a far better API exists to achieve the aforementioned goal.

A Primer on Processes

A running instance of a program is called a . Processes are referred to using a process ID (), which is an arbitrary number chosen by the kernel, usually limited to 32768, but on some distros, the max limit is 2²². The ( on MacOS, on some of the more recent Linux distros) is assigned .

The Problem with PIDs

One of the problems with process IDs has been the fact that they aren’t unique.

Let’s assume there’s a process X with pid 19448. Let’s also assume that there exists another process in the system, process Y, that is communicating with process X by referring to its pid (such as signalling pid 19448).

If now process X were to terminate, the same pid 19448 might be reissued by the kernel to another, newer process Z. This is called pid recycling.

At this point, if process Y signals pid 19448, it’s process Z that’ll get the signal, not process X that was initially assigned pid 19448.

This problem isn’t limited to signals. It applies to any API/system call that works with pids. Common examples include , and more.

It’s probably worth mentioning here that this is a solved problem in other operating systems, most notably FreeBSD with procdesc.

What is pidfd?

Unlike process ID which is any random integer assigned by the kernel, a is a persistent file descriptor that refers to another process. As with all file descriptors, pidfds are private to the process that has requested for the file descriptor.

The system call will allow process Y to get a file descriptor referring to process X. Another way to get this file descriptor is from . Yet another way to get is by setting the flag on the system call.

Once process Y has a referring to process X, it can use the system call to send a signal to process X. If process X has already terminated, the call will fail with the error ESRCH.

pidfd_getfd

So, a process can get a file descriptor referring to another process with the system call. But this won’t help solve the file descriptor transfer problem, where one process can transfer/send its file descriptors over to another proceess.

Now, while in theory lists all files a process has access to, it doesn’t list file descriptors referring to pipes (), sockets (), or other objects that do not appear in the filesystem hierarchy.

In 2020, on Linux versions 5.6 and above, a new system call was added to Linux that’ll enable a process to obtain a duplicate of a file descriptor of another process referred to by a with the system call. Both the file descriptor and its duplicate share the file status flags and file offset. This applies to all kinds of files, including socket files. Operations on the socket (such as , , , ) can be performed via the duplicate file descriptor.

Effectively, this single system call obviates the incredibly unintuitive and error-prone APIs for file descriptor transfer between processes over Unix Domain Sockets as described in my previous post.

The calling process must have the ability to call (or to be more specific, the access mode check, which governs the permission to read from or write to another process) on the target process from which it wants to get duplicate copies of file descriptors.

For another, more security focused use case of and , the post Seccomp Notify — New Frontiers in Unprivileged Container Development by Christian Brauner makes for really fun and informative reading.

@copyconstruct on Twitter. views expressed on this blog are solely mine, not those of present or past employers.