File Descriptor Transfer over Unix Domain Sockets
Update 12/31/2020: If you’re on a newer kernel (Linux 5.6+), much of this complexity has been obviated with the introduction of a new system call
pidfd_getfd
. Please see the post Seamless File Descriptor Transfer Between Processes withpidfd
andpidfd_getfd
published on 12/31/2020 for more details.
Yesterday, I read a phenomenal paper on how disruption free release of services that speak different protocols and serve different types of requests (long lived TCP/UDP sessions, requests involving huge chunks of data etc.) works at Facebook.
One of the techniques used by Facebook is what they call “Socket Takeover”.
Socket Takeover enables Zero Downtime Restarts for Proxygen by spinning up an updated instance in parallel that takes over the listening sockets, whereas the old instance goes into graceful draining phase. The new instance assumes the responsibility of serving the new connections and responding to health-check probes from the L4LB Katran. Old connections are served by the older instance until the end of draining period, after which other mechanism (e.g.,Downstream Connection Reuse) kicks in.
As we pass an open FD from the old process to the newly spun one, both the passing and the receiving process share the same file table entry for the listening socket and handle separate accepted connections on which they serve connection level transactions. We leverage the following Linux kernel features to achieve this:
CMSG: A feature in
sendmsg()
allows sending control messages between local processes (commonly referred to as ancillary data). During the restart of L7LB processes, we use this mechanism to send the set of FDs for all active listening socketsfor each VIP (Virtual IP of service) from the active instance to the newly spun instance. This data is exchanged usingsendmsg
andrecvmsg
over a UNIX domain socket.SCM_RIGHTS: We set this option to send open FDs with the data portion containing an integer array of the open FDs. On the receiving side, these FDs behave as though they have been created with dup(2).
I got a number of reponses on Twitter from folks expressing astonishment that this is even possible. Indeed, if you’re not very familiar with some of the features of Unix domain sockets, the aforementioned paragraph from the paper might be pretty inscrutable.
Transferring TCP sockets over a Unix domain socket is, actually, a tried and tested method to implement “hot restarts” or “zero downtime restarts”. Popular proxies like HAProxy and Envoy use very similar mechanisms to drain connections from one instance of the proxy to another without dropping any connections. However, many of these features are not very widely known.
In this post, I want to explore some of the features of Unix domain sockets that make it a suitable candidate for several of these use-cases, especially transferring a socket (or any file descriptor, for that matter) from one process to another where a parent-child relationship doesn’t necessarily exist between the two processes.
Unix Domain Sockets
It’s commonly known that Unix domain sockets allow communication between processes on the same host system. Unix domain sockets are used in many popular systems: HAProxy, Envoy, AWS’s Firecracker virtual machine monitor, Kubernetes, Docker and Istio to name a few.
UDS: A Brief Primer
Like network sockets, Unix domain sockets support both stream and datagram socket types. However, unlike network sockets that take an IP address and a port as the address, a Unix domain socket address takes the form of a pathname. Unlike network sockets, I/O across Unix domain sockets do not involve operations on the underlying device (which makes Unix domain sockets a lot faster compared to network sockets for performing IPC on the same host).
Binding a name to a Unix domain socket with bind(2)
creates a socket file named pathname in the filesystem. However, this file is different from any normal file you might create.
A simple Go program to create an “echo server” listening on a Unix domain socket would be the following:
If you build and run this program, a couple of interesting facts can be observed.
Socket Files != Normal Files
First, the socket file /tmp/uds.sock
is marked as a socket. When stat()
is applied to this pathname, it returns the value S_IFSOCK
in the file-type component of the st_mode
field of the stat
structure.
When listed with ls –l
, a UNIX domain socket is shown with the type s
in the first column, whereas an ls –F
appends an equal sign (=) to the socket pathname.
root@1fd53621847b:~/uds# ./uds
^C
root@1fd53621847b:~/uds# ls -ls /tmp
total 0
0 srwxr-xr-x 1 root root 0 Aug 5 01:45 uds.sockroot@1fd53621847b:~/uds# stat /tmp/uds.sock
File: /tmp/uds.sock
Size: 0 Blocks: 0 IO Block: 4096 socket
Device: 71h/113d Inode: 1835567 Links: 1
Access: (0755/srwxr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2020-08-05 01:45:41.650709000 +0000
Modify: 2020-08-05 01:45:41.650709000 +0000
Change: 2020-08-05 01:45:41.650709000 +0000Birth: -root@5247072fc542:~/uds# ls -F /tmp
uds.sock=
root@5247072fc542:~/uds#
Normal system calls that work on files don’t work on socket files: this means that system calls like open(), close(), read()
cannot be used on socket files. Instead, socket specific system calls like socket()
, bind()
, recv()
, sendmsg()
, recvmsg()
etc. are used to work with Unix domain sockets.
Another interesting fact about the socket file is that it is removed not when the socket is closed but rather is closed by calling:
unlink(2)
on MacOSremove()
or more commonly,unlink(2)
on Linux
On Linux, a Unix domain socket address is represented by the following
structure:
struct sockaddr_un {
sa_family_t sun_family; /* Always AF_UNIX */
char sun_path[108]; /* Pathname */
};
On MacOS, the address structure is as follows:
struct sockaddr_un {
u_char sun_len;
u_char sun_family;
char sun_path[104];
};
bind(2) will fail when trying to bind to an existing path
The SO_REUSEPORT
option allows multiple network sockets on any given host to connect to the same address and the port. The very first socket to try to bind to the given port needs to set the SO_REUSEPORT
option, and any subsequent socket can bind to the same port.
Support for SO_REUSEPORT
was introduced in Linux 3.9 and above. However, on Linux, all sockets that want to share the same address and port combination must belong to processes that share the same effective UID.
int fd = socket(domain, socktype, 0);int optval = 1;
setsockopt(sfd, SOL_SOCKET, SO_REUSEPORT, &optval, sizeof(optval));
bind(sfd, (struct sockaddr *) &addr, addrlen);
However, it’s not possible for two Unix domain sockets to bind to the same path.
SOCKETPAIR(2)
The socketpair()
function creates two sockets that are then connected together. In a manner of speaking, this is very similar to pipe
, except that it supports bidirectional transfer of data.
socketpair
only works with Unix domain sockets. It returns two file descriptors which are already connected to one another (so one doesn’t have to do the whole socket
→bind
→listen
→accept
dance to set up a listening socket and a socket
→connect
dance to create a client to the listening socket before beginning to transfer data!).
Data Transfer over UDS
Now that we’ve established that a Unix domain socket allows communication between two processes on the same host, it’s time to explore what kind of data can be transferred over a Unix domain socket.
Since a Unix domain socket is similar to network sockets in many respects, any data that one might usually send over a network socket can be sent over a Unix domain socket.
Furthermore, the special system calls sendmsg
and recvmsg
allow sending a special message across the Unix domain socket. This message is handled specially by the kernel, which allows passing open file descriptions from the sender to the receiver.
File Descriptors vs File Description
Note that I mentioned file descripTION and not file descripTOR. The difference between the two is subtle and isn’t often well understood.
A file descriptor really is just a per process pointer to an underlying kernel data structure called (confusingly) the file description. The kernel maintains a table of all open file descriptions called the open file table. If two processes (A and B) try to open the same file, the two processes might have their own separate file descriptors, which point to the same file description in the open file table.
So “sending a file descriptor” from one Unix domain socket to another with sendmsg()
really just means sending a reference to the file description. If process A were to send file descriptor 0 (fd0) to process B, the file descriptor might very well be referenced by the number 3 (fd3) in process B. They will, however, refer to the same file description.
The sending process calls sendmsg
to send the descriptor across the Unix domain socket. The receiving process calls recvmsg
to receive the descriptor on the Unix domain socket.
Even if the sending process closes its file descriptor referencing the file description being passed via sendmsg
before the receiving process calls recvmsg
, the file description remains open for the receiving process. Sending a descriptor increments the description’s reference count by one. The kernel only removes file descriptions from its open file table if the reference count drops to 0.
sendmsg and recvmsg
The signature for the sendmsg
function call on Linux is the following:
ssize_t sendmsg(
int socket,
const struct msghdr *message,
int flags
);
The counterpart of sendmsg
is recvmsg
:
ssize_t recvmsg(
int sockfd,
const struct msghdr *msg,
int flags
);
The special “message” that one can transfer with sendmsg
over a Unix domain socket is specified by the msghdr
. The process which wishes to send the file description over to another process creates a msghdr
structure containing the description to be passed.
struct msghdr {
void *msg_name; /* optional address */
socklen_t msg_namelen; /* size of address */
struct iovec *msg_iov; /* scatter/gather array */
int msg_iovlen; /* # elements in msg_iov */
void *msg_control; /* ancillary data, see below */
socklen_t msg_controllen; /* ancillary data buffer len */
int msg_flags; /* flags on received message */
};
The msg_control
member of the msghdr
structure, which has length msg_controllen
, points to a buffer of messages of the form:
struct cmsghdr {
socklen_t cmsg_len; /* data byte count, including header */
int cmsg_level; /* originating protocol */
int cmsg_type; /* protocol-specific type */
/* followed by */
unsigned char cmsg_data[];};
In POSIX, a buffer of struct cmsghdr structures with appended data is called ancillary data. On Linux, the maximum buffer size allowed per socket can be set by modifying /proc/sys/net/core/optmem_max
.
Ancillary Data Transfer
While there are a plethora of gotchas with such data transfer, when used correctly, it can be a pretty powerful mechanism to achieve a number of goals.
On Linux, there are three such types of “ancillary data” that can be shared between two Unix domain sockets:
SCM_RIGHTS
SCM_CREDENTIALS
SCM_SECURITY
All three forms of ancillary data should only be accessed using the macros described below and never directly.
struct cmsghdr *CMSG_FIRSTHDR(struct msghdr *msgh);
struct cmsghdr *CMSG_NXTHDR(struct msghdr *msgh, struct cmsghdr *cmsg);
size_t CMSG_ALIGN(size_t length);
size_t CMSG_SPACE(size_t length);
size_t CMSG_LEN(size_t length);
unsigned char *CMSG_DATA(struct cmsghdr *cmsg);
While I’ve never had a need to use the latter two, SCM_RIGHTS
is what I hope to explore more in this post.
SCM_RIGHTS
SCM_RIGHTS
allows a process to send or receive a set of open file descriptors from another process using sendmsg
.
The cmsg_data component of the cmsghdr structure can contain an array of the file descriptors that a process wants to send to another.
struct cmsghdr {
socklen_t cmsg_len; /* data byte count, including header */
int cmsg_level; /* originating protocol */
int cmsg_type; /* protocol-specific type */
/* followed by */
unsigned char cmsg_data[];};
The receiving process uses recvmsg
to receive the data.
The book The Linux Programming Interface has a good programmatic guide on how to use the sendmsg
and recvmsg
.
SCM_RIGHTS Gotchas
As mentioned, there are a number of gotchas when trying to pass ancillary data over Unix domain sockets.
Need to send some “real” data along with the ancillary message
On Linux, at least one byte of “real data” is required to successfully send ancillary data over a Unix domain stream socket.
However, when sending ancillary data over a Unix domain datagram socket on Linux, it is not necessary to send any accompanying real data. That said, portable applications should also include at least one byte of real data when sending ancillary data over a datagram socket.
File Descriptors can be dropped
If the buffer cmsg_data
used to receive the ancillary data containing the file descriptors is too small (or is absent), then the ancillary data is truncated (or discarded) and the excess file descriptors are automatically closed in the receiving process.
If the number of file descriptors received in the ancillary data cause the process to exceed its RLIMIT_NOFILE resource limit, the excess file descriptors are automatically closed in the receiving process. One cannot split the list over multiple recvmsg
calls.
recvmsg quirks
sendmsg
and recvmsg
act similar to send
and recv
system calls, in that there isn’t a 1:1 mapping between every send
call and every recv
call.
A single recvmsg
call can read data from multiple sendmsg
calls. Likewise, it can take multiple recvmsg
calls to consume the data sent over a single sendmsg
call. This has serious and surprising implications, some of which have been reported here.
Limit on the number of File Descriptions
The kernel constant SCM_MAX_FD ( 253 (or 255 in kernels before 2.6.38)) defines a limit on the number of file descriptors in the array.
Attempting to send an array larger than this limit causes sendmsg
to fail with the error EINVAL.
When is it useful to transfer file descriptors?
A very concrete real world use case where this is used is zero downtime proxy reloads.
Anyone who’s ever had to work with HAProxy can attest that “zero downtime config reloads” wasn’t really a thing for a long time. Often, a plethora of Rube Goldberg-esque hacks were used to achieve this.
In late 2017, HAProxy 1.8 shipped with support for hitless reloads achieved by transferring the listening socket file descriptors from the old HAProxy process to the new one. Envoy uses a similar mechanism for hot restarts where file descriptors are passed over a Unix domain socket.
In late 2018, Cloudflare blogged about its use of transferring file descriptors from nginx to a Go TLS 1.3 proxy.
The paper on how Facebook achieves zero downtime releases that prompted me to write this entire blog post uses the selfsame CMSG + SCM_RIGHTS trick to pass live file descriptors from the draining process to the newly released process.
Conclusion
Transferring file descriptors over a Unix domain socket can prove to be very powerful if used correctly. I hope this post gave you a slightly better understanding of Unix domain sockets and features it enables.
References:
- https://www.man7.org/linux/man-pages/man7/unix.7.html
- https://blog.cloudflare.com/know-your-scm_rights/
- LWN.net has an interesting article on creating cycles when passing file descriptions over a Unix domain socket and implications for the fabulous new io_uring kernel API. https://lwn.net/Articles/779472/
- The Linux Programming Interface https://learning.oreilly.com/library/view/the-linux-programming/9781593272203/
- UNIX Network Programming: The Sockets Networking API https://learning.oreilly.com/library/view/the-sockets-networking/0131411551/