寻找Python中concurrent.futures中BrokenProcessPool的原因

Question

寻找Python中concurrent.futures中BrokenProcessPool的原因

pythondebuggingconcurrent.futures

37

简而言之

使用concurrent.futures并行编程时，我遇到了一个BrokenProcessPool异常。没有显示进一步的错误信息。我想找出错误的原因，并请教如何解决。

完整问题

我正在使用concurrent.futures来并行化一些代码。

with ProcessPoolExecutor() as pool:
    mapObj = pool.map(myMethod, args)

最终我得到了（并且只得到了）以下异常：

concurrent.futures.process.BrokenProcessPool: A child process terminated abruptly, the process pool is not usable anymore

很遗憾，这个程序很复杂，错误只会在程序运行了30分钟后才出现。因此，我无法提供一个简单明了的最小示例。

为了找到问题的原因，我在并行运行的方法周围包了一个try-except块：

def myMethod(*args):
    try:
        ...
    except Exception as e:
        print(e)

问题仍然存在，except块从未执行。我得出结论，异常并不来自我的代码。

接下来，我的下一步是编写一个自定义的ProcessPoolExecutor类，它是原始ProcessPoolExecutor的子类，并允许我替换其中一些方法为自定义方法。我复制并粘贴了方法_process_worker的原始代码，并添加了一些打印语句。

def _process_worker(call_queue, result_queue):
    """Evaluates calls from call_queue and places the results in result_queue.
        ...
    """
    while True:
        call_item = call_queue.get(block=True)
        if call_item is None:
            # Wake up queue management thread
            result_queue.put(os.getpid())
            return
        try:
            r = call_item.fn(*call_item.args, **call_item.kwargs)
        except BaseException as e:
                print("??? Exception ???")                 # newly added
                print(e)                                   # newly added
            exc = _ExceptionWithTraceback(e, e.__traceback__)
            result_queue.put(_ResultItem(call_item.work_id, exception=exc))
        else:
            result_queue.put(_ResultItem(call_item.work_id,
                                         result=r))

同样地，except块永远不会被执行。这是可以预料的，因为我已经确保我的代码不会引发异常（如果一切正常，异常应该会传递到主进程）。

现在我缺乏想法如何找出错误。异常是在这里引发的:

def submit(self, fn, *args, **kwargs):
    with self._shutdown_lock:
        if self._broken:
            raise BrokenProcessPool('A child process terminated '
                'abruptly, the process pool is not usable anymore')
        if self._shutdown_thread:
            raise RuntimeError('cannot schedule new futures after shutdown')

        f = _base.Future()
        w = _WorkItem(f, fn, args, kwargs)

        self._pending_work_items[self._queue_count] = w
        self._work_ids.put(self._queue_count)
        self._queue_count += 1
        # Wake up queue management thread
        self._result_queue.put(None)

        self._start_queue_management_thread()
        return f

这里将要中断进程池的运行：

def _queue_management_worker(executor_reference,
                             processes,
                             pending_work_items,
                             work_ids_queue,
                             call_queue,
                             result_queue):
    """Manages the communication between this process and the worker processes.
        ...
    """
    executor = None

    def shutting_down():
        return _shutdown or executor is None or executor._shutdown_thread

    def shutdown_worker():
        ...

    reader = result_queue._reader

    while True:
        _add_call_item_to_queue(pending_work_items,
                                work_ids_queue,
                                call_queue)

        sentinels = [p.sentinel for p in processes.values()]
        assert sentinels
        ready = wait([reader] + sentinels)
        if reader in ready:
            result_item = reader.recv()
        else:                               #THIS BLOCK IS ENTERED WHEN THE ERROR OCCURS
            # Mark the process pool broken so that submits fail right now.
            executor = executor_reference()
            if executor is not None:
                executor._broken = True
                executor._shutdown_thread = True
                executor = None
            # All futures in flight must be marked failed
            for work_id, work_item in pending_work_items.items():
                work_item.future.set_exception(
                    BrokenProcessPool(
                        "A process in the process pool was "
                        "terminated abruptly while the future was "
                        "running or pending."
                    ))
                # Delete references to object. See issue16284
                del work_item
            pending_work_items.clear()
            # Terminate remaining workers forcibly: the queues or their
            # locks may be in a dirty state and block forever.
            for p in processes.values():
                p.terminate()
            shutdown_worker()
            return
        ...

似乎有一个事实，即进程终止，但我不知道原因。到目前为止，我的想法正确吗？有哪些可能导致进程在没有消息的情况下终止的原因？（这可能吗？）我在哪里可以应用进一步的诊断？我应该问自己哪些问题才能更接近解决方案？

我正在64位Linux上使用Python 3.5。

- Samufi

我遇到了这个错误，而这篇文章解决了我的问题。https://dev59.com/52Uo5IYBdhLWcg3wnwj2 - kmh

我得到了相同的错误，多进程的退出码是-11。而在多线程中，相同的函数工作正常。 - WeiChing 林煒清

2个回答

10

如果您正在使用 macOS，那么某些版本的 macOS 使用 fork 的方式存在已知问题，在某些情况下 Python 认为这种方式不是 fork-safe。对我有效的解决方法是使用 no_proxy 环境变量。

编辑 ~/.bash_profile 并包含以下内容（最好在此处指定域或子网的列表，而不是 *）

no_proxy='*'

刷新当前上下文

source ~/.bash_profile

我在本地版本中遇到并解决的问题是: Python 3.6.0 在 macOS 10.14.1 和 10.13.x 上。

来源：问题30388 问题27126

- gowthamnvv

1

相同的问题出现在MacOS 10.14.6（18G87）和Python 3.7.2上。 - Ojasvi Monga

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Samufi · Accepted Answer

我认为我已经尽可能地做到了：

我修改了我的更改后的ProcessPoolExecutor模块中的_queue_management_worker方法，以便打印失败进程的退出代码：

def _queue_management_worker(executor_reference,
                             processes,
                             pending_work_items,
                             work_ids_queue,
                             call_queue,
                             result_queue):
    """Manages the communication between this process and the worker processes.
        ...
    """
    executor = None

    def shutting_down():
        return _shutdown or executor is None or executor._shutdown_thread

    def shutdown_worker():
        ...

    reader = result_queue._reader

    while True:
        _add_call_item_to_queue(pending_work_items,
                                work_ids_queue,
                                call_queue)

        sentinels = [p.sentinel for p in processes.values()]
        assert sentinels
        ready = wait([reader] + sentinels)
        if reader in ready:
            result_item = reader.recv()
        else:                               

            # BLOCK INSERTED FOR DIAGNOSIS ONLY ---------
            vals = list(processes.values())
            for s in ready:
                j = sentinels.index(s)
                print("is_alive()", vals[j].is_alive())
                print("exitcode", vals[j].exitcode)
            # -------------------------------------------


            # Mark the process pool broken so that submits fail right now.
            executor = executor_reference()
            if executor is not None:
                executor._broken = True
                executor._shutdown_thread = True
                executor = None
            # All futures in flight must be marked failed
            for work_id, work_item in pending_work_items.items():
                work_item.future.set_exception(
                    BrokenProcessPool(
                        "A process in the process pool was "
                        "terminated abruptly while the future was "
                        "running or pending."
                    ))
                # Delete references to object. See issue16284
                del work_item
            pending_work_items.clear()
            # Terminate remaining workers forcibly: the queues or their
            # locks may be in a dirty state and block forever.
            for p in processes.values():
                p.terminate()
            shutdown_worker()
            return
        ...

之后我查找了退出代码的含义：

from multiprocessing.process import _exitcode_to_name
print(_exitcode_to_name[my_exit_code])

通过在我插入到_queue_management_worker的块中打印的退出代码，其中my_exit_code是退出代码。在我的情况下，该代码为-11，这意味着我遇到了分段错误。查找此问题的原因将是一项巨大的任务，但超出了本问题的范围。