epoll、poll和threadpool有什么区别?

79

有人能解释一下 epollpoll 和线程池之间的区别吗?

  • 它们各自的优缺点是什么?
  • 有哪些框架可以推荐?
  • 有哪些简单/基础的教程可以推荐?
  • 似乎 epollpoll 是专属于 Linux 的... 有没有 Windows 系统的替代方案?
1个回答

225

线程池不太适用于与poll和epoll相同的类别,因此我将假设您指的是“使用线程池处理许多连接,每个连接一个线程”的线程池。

优缺点

  • threadpool
    • Reasonably efficient for small and medium concurrency, can even outperform other techniques.
    • Makes use of multiple cores.
    • Does not scale well beyond "several hundreds" even though some systems (e.g. Linux) can in principle schedule 100,000s of threads just fine.
    • Naive implementation exhibits "thundering herd" problem.
    • Apart from context switching and thundering herd, one must consider memory. Each thread has a stack (typically at least a megabyte). A thousand threads therefore take a gigabyte of RAM just for stack. Even if that memory is not committed, it still takes away considerable address space under a 32 bit OS (not really an issue under 64 bits).
    • Threads can actually use epoll, though the obvious way (all threads block on epoll_wait) is of no use, because epoll will wake up every thread waiting on it, so it will still have the same issues.
      • Optimal solution: single thread listens on epoll, does the input multiplexing, and hands complete requests to a threadpool.
      • futex is your friend here, in combination with e.g. a fast forward queue per thread. Although badly documented and unwieldy, futex offers exactly what's needed. epoll may return several events at a time, and futex lets you efficiently and in a precisely controlled manner wake N blocked threads at a time (N being min(num_cpu, num_events) ideally), and in the best case it does not involve an extra syscall/context switch at all.
      • Not trivial to implement, takes some care.
  • fork (a.k.a. old fashion threadpool)
    • Reasonably efficient for small and medium concurrency.
    • Does not scale well beyond "few hundreds".
    • Context switches are much more expensive (different address spaces!).
    • Scales significantly worse on older systems where fork is much more expensive (deep copy of all pages). Even on modern systems fork is not "free", although the overhead is mostly coalesced by the copy-on-write mechanism. On large datasets which are also modified, a considerable number of page faults following fork may negatively impact performance.
    • However, proven to work reliably for over 30 years.
    • Ridiculously easy to implement and rock solid: If any of the processes crash, the world does not end. There is (almost) nothing you can do wrong.
    • Very prone to "thundering herd".
  • poll / select
    • Two flavours (BSD vs. System V) of more or less the same thing.
    • Somewhat old and slow, somewhat awkward usage, but there is virtually no platform that does not support them.
    • Waits until "something happens" on a set of descriptors
      • Allows one thread/process to handle many requests at a time.
      • No multi-core usage.
    • Needs to copy list of descriptors from user to kernel space every time you wait. Needs to perform a linear search over descriptors. This limits its effectiveness.
    • Does not scale well to "thousands" (in fact, hard limit around 1024 on most systems, or as low as 64 on some).
    • Use it because it's portable if you only deal with a dozen descriptors anyway (no performance issues there), or if you must support platforms that don't have anything better. Don't use otherwise.
    • Conceptually, a server becomes a little more complicated than a forked one, since you now need to maintain many connections and a state machine for each connection, and you must multiplex between requests as they come in, assemble partial requests, etc. A simple forked server just knows about a single socket (well, two, counting the listening socket), reads until it has what it wants or until the connection is half-closed, and then writes whatever it wants. It doesn't worry about blocking or readiness or starvation, nor about some unrelated data coming in, that's some other process's problem.
  • epoll
    • Linux only.
    • Concept of expensive modifications vs. efficient waits:
      • Copies information about descriptors to kernel space when descriptors are added (epoll_ctl)
        • This is usually something that happens rarely.
      • Does not need to copy data to kernel space when waiting for events (epoll_wait)
        • This is usually something that happens very often.
      • Adds the waiter (or rather its epoll structure) to descriptors' wait queues
        • Descriptor therefore knows who is listening and directly signals waiters when appropriate rather than waiters searching a list of descriptors
        • Opposite way of how poll works
        • O(1) with small k (very fast) in respect of the number of descriptors, instead of O(n)
    • Works very well with timerfd and eventfd (stunning timer resolution and accuracy, too).
    • Works nicely with signalfd, eliminating the awkward handling of signals, making them part of the normal control flow in a very elegant manner.
    • An epoll instance can host other epoll instances recursively
    • Assumptions made by this programming model:
      • Most descriptors are idle most of the time, few things (e.g. "data received", "connection closed") actually happen on few descriptors.
      • Most of the time, you don't want to add/remove descriptors from the set.
      • Most of the time, you're waiting on something to happen.
    • Some minor pitfalls:
      • A level-triggered epoll wakes all threads waiting on it (this is "works as intended"), therefore the naive way of using epoll with a threadpool is useless. At least for a TCP server, it is no big issue since partial requests would have to be assembled first anyway, so a naive multithreaded implementation won't do either way.
      • Does not work as one would expect with file read/writes ("always ready").
      • Could not be used with AIO until recently, now possible via eventfd, but requires a (to date) undocumented function.
      • If the above assumptions are not true, epoll can be inefficient, and poll may perform equally or better.
      • epoll cannot do "magic", i.e. it is still necessarily O(N) in respect to the number of events that occur.
      • However, epoll plays well with the new recvmmsg syscall, since it returns several readiness notifications at a time (as many as are available, up to whatever you specify as maxevents). This makes it possible to receive e.g. 15 EPOLLIN notifications with one syscall on a busy server, and read the corresponding 15 messages with a second syscall (a 93% reduction in syscalls!). Unluckily, all operations on one recvmmsg invokation refer to the same socket, so it is mostly useful for UDP based services (for TCP, there would have to be a kind of recvmmsmsg syscall which also takes a socket descriptor per item!).
      • Descriptors should always be set to nonblocking and one should check for EAGAIN even when using epoll because there are exceptional situations where epoll reports readiness and a subsequent read (or write) will still block. This is also the case for poll/select on some kernels (though it has presumably been fixed).
      • With a naive implementation, starvation of slow senders is possible. When blindly reading until EAGAIN is returned upon receiving a notification, it is possible to indefinitely read new incoming data from a fast sender while completely starving a slow sender (as long as data keeps coming in fast enough, you might not see EAGAIN for quite a while!). Applies to poll/select in the same manner.
      • Edge-triggered mode has some quirks and unexpected behaviour in some situations, since the documentation (both man pages and TLPI) is vague ("probably", "should", "might") and sometimes misleading about its operation.
        The documentation states that several threads waiting on one epoll are all signalled. It further states that a notification tells you whether IO activity has happened since the last call to epoll_wait (or since the descriptor was opened, if there was no previous call).
        The true, observable behaviour in edge-triggered mode is much closer to "wakes the first thread that has called epoll_wait, signalling that IO activity has happened since anyone last called either epoll_wait or a read/write function on the descriptor, and thereafter only reports readiness again to the next thread calling or already blocked in epoll_wait, for any operations happening after anyone called a of read (or write) function on the descriptor". It kind of makes sense, too... it just isn't exactly what the documentation suggests.
  • kqueue
    • BSD analogon to epoll, different usage, similar effect.
    • Also works on Mac OS X
    • Rumoured to be faster (I've never used it, so cannot tell if that is true).
    • Registers events and returns a result set in a single syscall.
  • IO Completion ports
    • Epoll for Windows, or rather epoll on steroids.
    • Works seamlessly with everything that is waitable or alertable in some way (sockets, waitable timers, file operations, threads, processes)
    • If Microsoft got one thing right in Windows, it is completion ports:
      • Works worry-free out of the box with any number of threads
      • No thundering herd
      • Wakes threads one by one in a LIFO order
      • Keeps caches warm and minimizes context switches
      • Respects number of processors on machine or delivers the desired number of workers
    • Allows the application to post events, which lends itself to a very easy, failsafe, and efficient parallel work queue implementation (schedules upwards of 500,000 tasks per second on my system).
    • Minor disadvantage: Does not easily remove file descriptors once added (must close and re-open).

框架

libevent -- 2.0版本还支持Windows下的完成端口。

ASIO -- 如果你在项目中使用Boost,那么你已经可以使用boost-asio了。

有没有简单/基础教程的建议?

上述框架都配有详细的文档。Linux 文档和MSDN对epoll和完成端口进行了详细解释。

使用epoll的迷你教程:

int my_epoll = epoll_create(0);  // argument is ignored nowadays

epoll_event e;
e.fd = some_socket_fd; // this can in fact be anything you like

epoll_ctl(my_epoll, EPOLL_CTL_ADD, some_socket_fd, &e);

...
epoll_event evt[10]; // or whatever number
for(...)
    if((num = epoll_wait(my_epoll, evt, 10, -1)) > 0)
        do_something();

IO完成端口的迷你教程(注意使用不同参数两次调用CreateIoCompletionPort):

HANDLE iocp = CreateIoCompletionPort(INVALID_HANDLE_VALUE, 0, 0, 0); // equals epoll_create
CreateIoCompletionPort(mySocketHandle, iocp, 0, 0); // equals epoll_ctl(EPOLL_CTL_ADD)

OVERLAPPED o;
for(...)
    if(GetQueuedCompletionStatus(iocp, &number_bytes, &key, &o, INFINITE)) // equals epoll_wait()
        do_something();

(这些小教程省略了所有的错误检查,希望我没有打错字,但它们应该大致上可以给您一些想法。)
编辑:
请注意,完成端口(Windows)在概念上与 epoll(或 kqueue)相反。正如它们的名字所示,它们发出信号表示“完成”,而不是“准备好”。也就是说,您发出异步请求并在某个时候忘记它,直到告诉您已经完成(可能成功,也可能不成功,还有一个特例是“立即完成”)。
使用 epoll,您会阻塞,直到收到通知,指出“某些数据”(可能只有一个字节)已到达并可用,或者有足够的缓冲区空间,以便您可以执行写操作而不阻塞。只有在这种情况下,您才开始实际操作,然后希望不会阻塞(除了您期望的之外,没有严格的保证——因此将描述符设置为非阻塞并检查 EAGAIN [对于套接字,EAGAIN 和 EWOULDBLOCK 都要检查,因为标准允许两种不同的错误值] 是一个好主意)。

1
我不同意你关于I/O完成端口是微软做得正确的事情的说法。很高兴你在编辑中注意到了它的反向设计! - hookenz
1
实际上,我对此的看法有所改变。在使用RDMA后,IOCP API更符合该模型。潜在的性能更好。但在实践中,我不太确定。无论如何...我不会再说它是过时的,只是不同,并且更难理解。 - hookenz
我喜欢你提供的所有细节。我认为EPOLLET仍然唤醒所有线程。fs/eventpoll.c: ep_send_events_proc()是唯一使用该标志的函数,仅用于确定是否应将其插入回准备好的列表中。 - Ant Manelope
我很好奇为什么signalfd()在使用epoll时被提出作为一个优点,而不是与poll()select()一样有效。 - Pavel Šimerda
@PavelŠimerda:signalfd可以与pollselect一起使用,我并没有说过不行。然而,这些API与描述符的数量成线性比例,将signalfd添加到集合中是_另一个描述符_(只有一个,但仍然会随着每个描述符的增加而增加开销)。epoll仅在发生事件的数量上成线性比例,由于信号很少发生,因此signalfd将“大多数时间处于空闲状态”,因此是epoll的完美组合。如上所述,epoll对于观察“少量活动”的文件描述符没有显着优势,但对于“许多大多数空闲”的文件描述符而言则有。 - Damon
显示剩余3条评论

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接