Seamless file descriptor transfer between processes with pidfd and pidfd_getfd

A while ago, I wrote about how file descriptors can be transferred over Unix Domain Sockets between processes, when a parent child relationship doesn’t exist between the two processes. One of the use cases for file descriptor transfer between processes is during deployment of network proxies that handle ingress traffic. However, the APIs offered by the kernel for file descriptor transfer between processes have been awkward to use and riddled with a number of gotchas.

On newer versions of Linux (5.6 and above), a far better API exists to achieve the aforementioned goal.

A Primer on Processes

The Problem with PIDs

Let’s assume there’s a process X with pid 19448. Let’s also assume that there exists another process in the system, process Y, that is communicating with process X by referring to its pid (such as signalling pid 19448).

If now process X were to terminate, the same pid 19448 might be reissued by the kernel to another, newer process Z. This is called pid recycling.

At this point, if process Y signals pid 19448, it’s process Z that’ll get the signal, not process X that was initially assigned pid 19448.

This problem isn’t limited to signals. It applies to any API/system call that works with pids. Common examples include kill, pkill and more.

It’s probably worth mentioning here that this is a solved problem in other operating systems, most notably FreeBSD with procdesc.

What is pidfd?

The pidfd_open system call will allow process Y to get a file descriptor referring to process X. Another way to get this file descriptor is from /proc/pid_id. Yet another way to get is by setting the CLONE_PIDFD flag on the clone(2) system call.

Once process Y has a pidfd referring to process X, it can use the pidfd_send_signal system call to send a signal to process X. If process X has already terminated, the pidfd_send_signal call will fail with the error ESRCH.

pidfd_getfd

Now, while /proc/pid/fd in theory lists all files a process has access to, it doesn’t list file descriptors referring to pipes (S_IFIFO), sockets (S_IFSOCK), or other objects that do not appear in the filesystem hierarchy.

In 2020, on Linux versions 5.6 and above, a new system call was added to Linux that’ll enable a process to obtain a duplicate of a file descriptor of another process referred to by a pidfd with the pidfd_getfd system call. Both the file descriptor and its duplicate share the file status flags and file offset. This applies to all kinds of files, including socket files. Operations on the socket (such as bind(), recv(), sendmsg(), recvmsg()) can be performed via the duplicate file descriptor.

Effectively, this single system call obviates the incredibly unintuitive and error-prone APIs for file descriptor transfer between processes over Unix Domain Sockets as described in my previous post.

The calling process must have the ability to call ptrace (or to be more specific, the PTRACE_MODE_ATTACH_REALCREDS access mode check, which governs the permission to read from or write to another process) on the target process from which it wants to get duplicate copies of file descriptors.

For another, more security focused use case of pidfd and pidfd_getfd, the post Seccomp Notify — New Frontiers in Unprivileged Container Development by Christian Brauner makes for really fun and informative reading.

@copyconstruct on Twitter. views expressed on this blog are solely mine, not those of present or past employers.