Dask：如何避免任务超时？

Question

Dask：如何避免任务超时？

4

在我基于dask的应用程序中（使用distributed调度程序），我看到从以下错误文本开始的故障：

tornado.application - ERROR - Exception in Future <Future cancelled> after timeout
Traceback (most recent call last):
  File "/miniconda/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 970, in error_callback
    future.result()
concurrent.futures._base.CancelledError

接下来是第二个回溯，它（我认为）指示了当超时发生时我的任务正在运行的哪一行。（distributed 究竟是如何做到这一点的并不清楚 - 也许是通过信号？）

以下是第二个回溯中与 dask 相关的部分：

  ... my code...

  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/base.py", line 156, in compute
    (result,) = compute(self, traverse=False, **kwargs)
  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/base.py", line 397, in compute
    results = schedule(dsk, keys, **kwargs)
  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 2308, in get
    direct=direct)
  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 1647, in gather
    asynchronous=asynchronous)
  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 665, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/utils.py", line 277, in sync
    six.reraise(*error[0])
  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/utils.py", line 262, in f
    result[0] = yield future
  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/distributed/client.py", line 1492, in _gather
    traceback)
  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/six.py", line 692, in reraise
    raise value.with_traceback(tb)
  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/bag/core.py", line 1562, in reify
    seq = list(seq)
  File "/groups/flyem/proj/cluster/miniforge/envs/flyem/lib/python3.6/site-packages/dask/bag/core.py", line 1722, in map_chunk
    yield f(*a)

  ... my code ....

“after timeout”是否表示任务执行时间过长，或者是其他超时（如保姆或心跳）触发了取消操作？（据我所知，在dask中没有明确规定任务长度的超时时间，但我可能有些困惑。）
我看到任务被取消了。但我想知道为什么。有没有简单的方法可以找出是哪一行代码（在dask或distributed中）取消了我的任务，以及为什么取消？
我预计这些任务需要很长时间 - 它们正在将大缓冲区上传到云存储。如何增加dask中特定任务的超时时间？

- Stuart Berg

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- MRocklin · Accepted Answer

Dask默认不对任务设置超时。你看到的取消future不是Dask future，而是Tornado future（Tornado是Dask用于网络通信的库）。因此，很遗憾，所有这些都只是说明某些事情失败了。随后的回溯希望包含有关失败代码的确切信息。理想情况下，这将指向您的函数中出现故障的行。也许这有所帮助？一般来说，我们建议在调试通过Dask运行的代码时执行以下步骤：http://docs.dask.org/en/latest/debugging.html