rq(redis队列)的核心意外终止，有什么建议进行调试？

Question

rq(redis队列)的核心意外终止，有什么建议进行调试？

4

我正在使用RQ worker来处理大量的作业，但是遇到了问题。

观察结果如下：

- 作业返回错误信息work-horse terminated unexpectedly; waitpid returned None。 - 该作业连接到数据库并运行几个SQL语句，例如简单的插入或删除语句。 - 错误消息几乎立即发生：在启动后几秒钟内。 - 有时作业正常运行，没有问题。 - 在其中一个作业中，我可以看到它执行了一个插入操作，但随后返回了错误。 - 在RQ worker上，我看到以下日志条目。

{"message": "my_queue: my_job() (dcf797c4-1434-4b77-a344-5bbb1f775113)"}
{"message": "Killed horse pid 8451"}
{"message": "Moving job to FailedJobRegistry (work-horse terminated unexpectedly; waitpid returned None)"}

深入研究rq代码(https://github.com/rq/rq)，"Killed horse pid..."这行是提示RQ正在有意地杀死工作本身。唯一发生工作杀死代码的地方是以下片段。要到达self.kill_horse()这行，必须发生HorseMonitorTimeoutException并且utcnow - job.started_at的差异必须大于job.timeout（timeout非常大）。

        while True:
            try:
                with UnixSignalDeathPenalty(self.job_monitoring_interval, HorseMonitorTimeoutException):
                    retpid, ret_val = os.waitpid(self._horse_pid, 0)
                break
            except HorseMonitorTimeoutException:
                # Horse has not exited yet and is still running.
                # Send a heartbeat to keep the worker alive.
                self.heartbeat(self.job_monitoring_interval + 5)

                # Kill the job from this side if something is really wrong (interpreter lock/etc).
                if job.timeout != -1 and (utcnow() - job.started_at).total_seconds() > (job.timeout + 1):
                    self.kill_horse()
                    break

有时候，这些工作在队列中停留了很长时间才被工作者实际处理。我本来以为 started_at 会被重置，但这个假设可能是错误的。

这些工作是使用 rq_scheduler 创建的，并且使用 cron 字符串定期触发（每天晚上11点等）。

我的下一步应该是什么？

- mj_

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- mj_ · Accepted Answer

我认为最新版本的RQ（https://github.com/rq/rq/releases/tag/v1.4.0）已经解决了这个问题。

修复了一个可能会导致定时或重新排队作业提前终止的错误。感谢@rmartin48！