Celery：WorkerLostError：工作者意外退出：信号 9（SIGKILL）

Question

Celery：WorkerLostError：工作者意外退出：信号 9（SIGKILL）

djangoamazon-ec2celeryamazon-elastic-beanstalksupervisord

70

我在我的Django应用程序中（在Elastic Beanstalk上）使用Celery和RabbitMQ来管理后台任务，并使用Supervisor将其变成守护进程。现在的问题是，我定义的其中一个定期任务失败了（在它正常工作了一周之后），我得到的错误信息是：

[01/Apr/2014 23:04:03] [ERROR] [celery.worker.job:272] Task clean-dead-sessions[1bfb5a0a-7914-4623-8b5b-35fc68443d2e] raised unexpected: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
Traceback (most recent call last):
  File "/opt/python/run/venv/lib/python2.7/site-packages/billiard/pool.py", line 1168, in mark_as_worker_lost
    human_status(exitcode)),
WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).

由supervisor管理的所有进程都正常运行（supervisorctl status显示RUNNING）。

我尝试阅读我的ec2实例上的几个日志，但似乎没有一个能帮助我找出SIGKILL的原因。我该怎么办？如何调查？

这是我的celery设置：

CELERY_TIMEZONE = 'UTC'
CELERY_TASK_SERIALIZER = 'json'
CELERY_ACCEPT_CONTENT = ['json']
BROKER_URL = os.environ['RABBITMQ_URL']
CELERY_IGNORE_RESULT = True
CELERY_DISABLE_RATE_LIMITS = False
CELERYD_HIJACK_ROOT_LOGGER = False

这是我的 supervisord.conf 文件:

[program:celery_worker]
environment=$env_variables
directory=/opt/python/current/app
command=/opt/python/run/venv/bin/celery worker -A com.cygora -l info --pidfile=/opt/python/run/celery_worker.pid
startsecs=10
stopwaitsecs=60
stopasgroup=true
killasgroup=true
autostart=true
autorestart=true
stdout_logfile=/opt/python/log/celery_worker.stdout.log
stdout_logfile_maxbytes=5MB
stdout_logfile_backups=10
stderr_logfile=/opt/python/log/celery_worker.stderr.log
stderr_logfile_maxbytes=5MB
stderr_logfile_backups=10
numprocs=1

[program:celery_beat]
environment=$env_variables
directory=/opt/python/current/app
command=/opt/python/run/venv/bin/celery beat -A com.cygora -l info --pidfile=/opt/python/run/celery_beat.pid --schedule=/opt/python/run/celery_beat_schedule
startsecs=10
stopwaitsecs=300
stopasgroup=true
killasgroup=true
autostart=false
autorestart=true
stdout_logfile=/opt/python/log/celery_beat.stdout.log
stdout_logfile_maxbytes=5MB
stdout_logfile_backups=10
stderr_logfile=/opt/python/log/celery_beat.stderr.log
stderr_logfile_maxbytes=5MB
stderr_logfile_backups=10
numprocs=1

编辑1

重启celery beat后问题仍然存在。

编辑2

将killasgroup=true更改为killasgroup=false，但问题仍然存在。

- daveoncode

提示：很可能是由于服务器内存/ RAM 不足。如果您正在通过 docker 命令运行容器，则可以使用 docker stats 查看每个容器的内存消耗。 - Krishna

2个回答

10

当使用celery异步任务或您正在使用的脚本泄露大量数据时，会出现此类错误。

在我的情况下，我正在从另一个系统获取数据并将其保存在变量中，以便在完成进程后可以将所有数据导出（到Django模型/ Excel文件）。

这就是问题所在。我的脚本正在收集1000万个数据; 在我收集数据时，它会泄漏内存。这导致了引发的异常。

为了解决该问题，我将1000万条数据分成了20个部分（每个部分50万条）。每当数据长度达到500,000条时，我都将数据存储在自己喜欢的本地文件/ Django模型中。我对每个批次的500k条数据都重复执行此操作。

不需要进行完全相同数量的分区。这是通过将复杂问题拆分为多个子问题并逐个解决子问题的思路 :D

- Farid Chowdhury

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Nino Walker · Accepted Answer

你的worker收到的SIGKILL是由另一个进程发起的。你的supervisord配置看起来很好，而killasgroup只会影响由supervisor启动的杀死进程(例如ctl或插件)——如果没有这个设置，它仍然会将信号发送给调度程序，而不是子进程。

很可能你有内存泄漏，操作系统的oomkiller因为进程表现不佳而终止了你的进程。

grep oom /var/log/messages。如果你看到消息，那就是你的问题所在。

如果找不到任何信息，请尝试在shell中手动运行周期性进程：

MyPeriodicTask().run()

看看会发生什么。我建议在另一个终端中使用top监视系统和进程指标，如果你没有像cactus、ganglia等这样的好工具对这个主机进行检测。