大量的multiprocessing.Process会导致死锁

8

背景

我需要在multiprocessing.ThreadPool中运行一个multiprocessing.Process。这似乎很奇怪,但这是我找到的唯一处理使用c++共享库可能导致的段错误的方法。如果发生段错误,则会杀死进程,我可以检查进程的exitcode并进行处理。

问题

一段时间后,在我尝试加入进程时会出现死锁。

以下是我的代码的简单版本:

import sys, time, multiprocessing
from multiprocessing.pool import ThreadPool

def main():
    # Launch 8 workers
    pool = ThreadPool(8)
    it = pool.imap(run, range(500))
    while True:
        try:
            it.next()
        except StopIteration:
            break

def run(value):
    # Each worker launch it own Process
    process = multiprocessing.Process(target=run_and_might_segfault,     args=(value,))
    process.start()

    while process.is_alive():
        sys.stdout.write('.')
        sys.stdout.flush()
        time.sleep(0.1)

    # Will never join after a while, because of a mystery deadlock
    process.join()

    # Deals with process.exitcode to log errors

def run_and_might_segfault(value):
    # Load a shared library and do stuff (could throw c++ exception, segfault ...)
    print(value)

if __name__ == '__main__':
    main()

以下是可能的输出:

➜  ~ python m.py
..0
1
........8
.9
.......10
......11
........12
13
........14
........16
........................................................................................

正如您所看到的,在几次迭代后,process.is_alive()始终为真,进程永远不会加入。

如果我按下CTRL-C键,则会得到以下堆栈跟踪:

Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/pool.py", line 680, in next
    item = self._items.popleft()
IndexError: pop from an empty deque

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "m.py", line 30, in <module>
    main()
  File "m.py", line 9, in main
    it.next()
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5    /lib/python3.5/multiprocessing/pool.py", line 684, in next
    self._cond.wait(timeout)
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5    /lib/python3.5/threading.py", line 293, in wait
    waiter.acquire()
KeyboardInterrupt

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.5.1/Frameworks/Python.framework/Versions/3.5    /lib/python3.5/multiprocessing/popen_fork.py", line 29, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt

PS 我在macOS上使用Python 3.5.2。

非常感谢任何形式的帮助。

Edit

我尝试使用Python 2.7,它可以正常工作。这可能只是Python 3.5的问题?

1个回答

10

这个问题也在最新版本的CPython上复现 - Python 3.7.0a0 (default:4e2cce65e522, Oct 13 2016, 21:55:44)

如果您使用gdb连接其中一个挂起的进程,您会发现它试图在sys.stdout.flush()调用中获取锁:

如果您使用gdb连接其中一个挂起的进程,您会发现它试图在sys.stdout.flush()调用中获取锁:

(gdb) py-list
 263                import traceback
 264                sys.stderr.write('Process %s:\n' % self.name)
 265                traceback.print_exc()
 266            finally:
 267                util.info('process exiting with exitcode %d' % exitcode)
>268                sys.stdout.flush()
 269                sys.stderr.flush()
 270
 271            return exitcode

Python级别的回溯看起来像这样:

 (gdb) py-bt
 Traceback (most recent call first):
   File "/home/rpodolyaka/src/cpython/Lib/multiprocessing/process.py", line 268, in _bootstrap
     sys.stdout.flush()
   File "/home/rpodolyaka/src/cpython/Lib/multiprocessing/popen_fork.py", line 74, in _launch
     code = process_obj._bootstrap()
   File "/home/rpodolyaka/src/cpython/Lib/multiprocessing/popen_fork.py", line 20, in __init__
     self._launch(process_obj)
   File "/home/rpodolyaka/src/cpython/Lib/multiprocessing/context.py", line 277, in _Popen
     return Popen(process_obj)
   File "/home/rpodolyaka/src/cpython/Lib/multiprocessing/context.py", line 223, in _Popen
     return _default_context.get_context().Process._Popen(process_obj)
   File "/home/rpodolyaka/src/cpython/Lib/multiprocessing/process.py", line 105, in start
     self._popen = self._Popen(self)
   File "deadlock.py", line 17, in run
     process.start()
   File "/home/rpodolyaka/src/cpython/Lib/multiprocessing/pool.py", line 119, in worker
     result = (True, func(*args, **kwds))
   File "/home/rpodolyaka/src/cpython/Lib/threading.py", line 864, in run
     self._target(*self._args, **self._kwargs)
   File "/home/rpodolyaka/src/cpython/Lib/threading.py", line 916, in _bootstrap_inner
     self.run()
   File "/home/rpodolyaka/src/cpython/Lib/threading.py", line 884, in _bootstrap
     self._bootstrap_inner()

在解释器级别上,它看起来像:
(gdb) frame 6

(gdb) list
287        return 0;
288    }
289    relax_locking = (_Py_Finalizing != NULL);
290    Py_BEGIN_ALLOW_THREADS
291    if (!relax_locking)
292        st = PyThread_acquire_lock(self->lock, 1);
293    else {
294        /* When finalizing, we don't want a deadlock to happen with daemon
295         * threads abruptly shut down while they owned the lock.
296         * Therefore, only wait for a grace period (1 s.). ... */

(gdb) p /x self->lock
$1 = 0xd25ce0

(gdb) p /x self->owner
$2 = 0x7f9bb2128700

请注意,从这个特定的子进程的角度来看,锁仍然属于父进程中的一个线程(LWP 1105)。
(gdb) info threads
  Id   Target Id         Frame
* 1    Thread 0x7f9bb5559440 (LWP 1102) "python" 0x00007f9bb5157577 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, 
    futex_word=0xe4d340) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  2    Thread 0x7f9bb312a700 (LWP 1103) "python" 0x00007f9bb4780253 in select () at ../sysdeps/unix/syscall-template.S:84
  3    Thread 0x7f9bb2929700 (LWP 1104) "python" 0x00007f9bb4780253 in select () at ../sysdeps/unix/syscall-template.S:84
  4    Thread 0x7f9bb2128700 (LWP 1105) "python" 0x00007f9bb4780253 in select () at ../sysdeps/unix/syscall-template.S:84
  5    Thread 0x7f9bb1927700 (LWP 1106) "python" 0x00007f9bb4780253 in select () at ../sysdeps/unix/syscall-template.S:84
  6    Thread 0x7f9bb1126700 (LWP 1107) "python" 0x00007f9bb4780253 in select () at ../sysdeps/unix/syscall-template.S:84
  7    Thread 0x7f9bb0925700 (LWP 1108) "python" 0x00007f9bb4780253 in select () at ../sysdeps/unix/syscall-template.S:84
  8    Thread 0x7f9b9bfff700 (LWP 1109) "python" 0x00007f9bb4780253 in select () at ../sysdeps/unix/syscall-template.S:84
  9    Thread 0x7f9b9b7fe700 (LWP 1110) "python" 0x00007f9bb4780253 in select () at ../sysdeps/unix/syscall-template.S:84
  10   Thread 0x7f9b9affd700 (LWP 1111) "python" 0x00007f9bb4780253 in select () at ../sysdeps/unix/syscall-template.S:84
  11   Thread 0x7f9b9a7fc700 (LWP 1112) "python" 0x00007f9bb5157577 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, 
    futex_word=0x7f9b80001ed0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205
  12   Thread 0x7f9b99ffb700 (LWP 1113) "python" 0x00007f9bb5157577 in futex_abstimed_wait_cancelable (private=0, abstime=0x0, expected=0, 
    futex_word=0x7f9b84001bb0) at ../sysdeps/unix/sysv/linux/futex-internal.h:205

因为在原始进程中同时在多个线程中对sys.stdout进行写和刷新,并创建子进程,所以确实出现了死锁。由于fork(2)系统调用的本质,子进程继承了父进程的内存,包括已获取的锁:必须在获取锁时执行fork()调用,即使父进程最终释放它,子进程也看不到,因为它们每个都有自己的内存空间,在写入时被复制。因此,在混合多线程与多进程时,必须非常小心,并确保在fork()之前所有锁都被正确释放,如果它们要在子进程中使用。这与http://bugs.python.org/issue6721中所描述的非常相似。请注意,如果从代码片段中删除与sys.stdout的交互,则可以正常工作。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接