The method to epoll’s madness

The syntax of epoll

Unlike poll, epoll itself is not a system call. It’s a kernel data structure that allows a process to multiplex I/O on multiple file descriptors.

1) epoll_create

The epoll instance is created by means of the epoll_create system call, which returns a file descriptor to the epoll instance. The signature of epoll_create is as follows:

#include <sys/epoll.h>
int epoll_create(int size);
int epoll_create1(int flags);

2) epoll_ctl

A process can add file descriptors it wants monitored to the epoll instance by calling epoll_ctl. All the file descriptors registered with an epoll instance are collectively called an epoll set or the interest list.

#include <sys/epoll.h>
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);

3) epoll_wait

A thread can be notified of events that happened on the epoll set/interest set of an epoll instance by calling the epoll_wait system call, which blocks until any of the descriptors being monitored becomes ready for I/O.

#include <sys/epoll.h>
int epoll_wait(int epfd, struct epoll_event *evlist, int maxevents, int timeout);

The gotchas of epoll

To fully understand the nuance behind epoll, it’s important to understand how file descriptors really work. This was explored in my previous post, but it’s worth restating again.

The bowels of epoll

Let us assume a process A has two open file descriptors fd0 and fd1, that have two open file descriptions in the open file table. Let is assume both these file descriptions point to different inodes.

Why epoll is more performant that select and poll

As stated in the previous post, the cost of select/poll is O(N), which means when N is very large (think of a web server handling tens of thousands of mostly sleepy clients), every time select/poll is called, even if there might only be a small number of events that actually occurred, the kernel still needs to scan every descriptor in the list.

int poll(struct pollfd *fds, nfds_t nfds, int timeout);int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout);

Edge triggered epoll

By default, epoll provides level-triggered notifications. Every call to epoll_wait only returns the subset of file descriptors belonging to the interest list that are ready.

function Poller:register(fd, r, w)
local ev = self.ev[0]
ev.events = bit.bor(C.EPOLLET, C.EPOLLERR, C.EPOLLHUP)
if r then
ev.events = bit.bor(ev.events, C.EPOLLIN)
end
if w then
ev.events = bit.bor(ev.events, C.EPOLLOUT)
end
ev.data.u64 = fd
local rc = C.epoll_ctl(self.fd, C.EPOLL_CTL_ADD, fd, ev)
if rc < 0 then errors.get(rc):abort() end
end
At t0, input arrives on the socket.
At time t4, the process calls epoll_wait
At time t6, the process calls epoll_wait again

Conclusion

This post aimed to capture the “method” part. In order to understand the “madness” wrecked by these semantics of epoll, a good reference would be the following two blog posts:

--

--

@copyconstruct on Twitter. views expressed on this blog are solely mine, not those of present or past employers.

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Cindy Sridharan

Cindy Sridharan

@copyconstruct on Twitter. views expressed on this blog are solely mine, not those of present or past employers.