防错的跨平台进程终止守护程序

Question

防错的跨平台进程终止守护程序

pythonlinuxprocesswatchdog

9

我有一些Python自动化程序，它生成使用Linux script命令记录的telnet会话；每个记录会话都有两个script进程ID（父进程和子进程）。

如果Python自动化脚本死亡，我需要解决一个问题，script会话将永远不会自行关闭。由于某种原因，这比应该更困难。

到目前为止，我已经实现了watchdog.py（请参见问题底部），它使自己成为守护进程，并在循环中轮询Python自动化脚本的PID。当它看到Python自动化PID从服务器的进程表中消失时，它尝试杀死script会话。

我的问题是：

script 会生成两个独立的进程，其中一个script是另一个script的父进程。
watchdog.py 不会杀死子script进程，如果我从自动化脚本中启动script会话（请参见下面的AUTOMATION EXAMPLE）。

自动化示例（`reproduce_bug.py`）

import pexpect as px
from subprocess import Popen
import code
import time
import sys
import os

def read_pid_and_telnet(_child, addr):
    time.sleep(0.1) # Give the OS time to write the PIDFILE
    # Read the PID in the PIDFILE
    fh = open('PIDFILE', 'r')
    pid = int(''.join(fh.readlines()))
    fh.close()
    time.sleep(0.1)
    # Clean up the PIDFILE
    os.remove('PIDFILE')
    _child.expect(['#', '\$'], timeout=3)
    _child.sendline('telnet %s' % addr)
    return str(pid)

pidlist = list()
child1 = px.spawn("""bash -c 'echo $$ > PIDFILE """
    """&& exec /usr/bin/script -f LOGFILE1.txt'""")
pidlist.append(read_pid_and_telnet(child1, '10.1.1.1'))

child2 = px.spawn("""bash -c 'echo $$ > PIDFILE """
    """&& exec /usr/bin/script -f LOGFILE2.txt'""")
pidlist.append(read_pid_and_telnet(child2, '10.1.1.2'))

cmd = "python watchdog.py -o %s -k %s" % (os.getpid(), ','.join(pidlist))
Popen(cmd.split(' '))
print "I started the watchdog with:\n   %s" % cmd

time.sleep(0.5)
raise RuntimeError, "Simulated script crash.  Note that script child sessions are hung"

现在，当我运行上述自动化示例时，会发生什么...请注意，PID 30017生成30018，PID 30020生成30021。所有上述PID都是脚本会话。

[mpenning@Hotcoffee Network]$ python reproduce_bug.py 
I started the watchdog with:
   python watchdog.py -o 30016 -k 30017,30020
Traceback (most recent call last):
  File "reproduce_bug.py", line 35, in <module>
    raise RuntimeError, "Simulated script crash.  Note that script child sessions are hung"
RuntimeError: Simulated script crash.  Note that script child sessions are hung
[mpenning@Hotcoffee Network]$

运行上述自动化后，所有子script会话仍在运行。

[mpenning@Hotcoffee Models]$ ps auxw | grep script
mpenning 30018  0.0  0.0  15832   508 ?        S    12:08   0:00 /usr/bin/script -f LOGFILE1.txt
mpenning 30021  0.0  0.0  15832   516 ?        S    12:08   0:00 /usr/bin/script -f LOGFILE2.txt
mpenning 30050  0.0  0.0   7548   880 pts/8    S+   12:08   0:00 grep script
[mpenning@Hotcoffee Models]$

我正在使用Python 2.6.6，在Debian Squeeze Linux系统下运行自动化程序（uname -a: Linux Hotcoffee 2.6.32-5-amd64 #1 SMP Mon Jan 16 16:22:28 UTC 2012 x86_64 GNU/Linux）。

问题：

似乎守护进程无法在生成进程崩溃后继续运行。如何修改watchdog.py以在自动化程序死亡时关闭所有脚本会话（如上例所示）？

watchdog.py日志说明了该问题（遗憾的是，PID与原始问题不符）...

[mpenning@Hotcoffee ~]$ cat watchdog.log 
2012-02-22,15:17:20.356313 Start watchdog.watch_process
2012-02-22,15:17:20.356541     observe pid = 31339
2012-02-22,15:17:20.356643     kill pids = 31352,31356
2012-02-22,15:17:20.356730     seconds = 2
[mpenning@Hotcoffee ~]$

解决方案

问题本质上是竞争条件。当我试图杀死“父”script进程时，它们已经在自动化事件的同时死亡了...

为了解决这个问题...首先，看门狗守护程序需要在轮询观察到的PID之前识别要被杀死的所有子进程的完整列表（我的原始脚本尝试在观察到的PID崩溃后识别子进程）。接下来，我必须修改我的看门狗守护程序，以允许一些script进程可能会随着观察到的PID一起死亡。

watchdog.py:

#!/usr/bin/python
"""
Implement a cross-platform watchdog daemon, which observes a PID and kills 
other PIDs if the observed PID dies.

Example:
--------

watchdog.py -o 29322 -k 29345,29346,29348 -s 2

The command checks PID 29322 every 2 seconds and kills PIDs 29345, 29346, 29348 
and their children, if PID 29322 dies.

Requires:
----------

 * https://github.com/giampaolo/psutil
 * http://pypi.python.org/pypi/python-daemon
"""
from optparse import OptionParser
import datetime as dt
import signal
import daemon
import logging
import psutil
import time
import sys
import os

class MyFormatter(logging.Formatter):
    converter=dt.datetime.fromtimestamp
    def formatTime(self, record, datefmt=None):
        ct = self.converter(record.created)
        if datefmt:
            s = ct.strftime(datefmt)
        else:
            t = ct.strftime("%Y-%m-%d %H:%M:%S")
            s = "%s,%03d" % (t, record.msecs)
        return s

def check_pid(pid):        
    """ Check For the existence of a unix / windows pid."""
    try:
        os.kill(pid, 0)   # Kill 0 raises OSError, if pid isn't there...
    except OSError:
        return False
    else:
        return True

def kill_process(logger, pid):
    try:
        psu_proc = psutil.Process(pid)
    except Exception, e:
        logger.debug('Caught Exception ["%s"] while looking up PID %s' % (e, pid))
        return False

    logger.debug('Sending SIGTERM to %s' % repr(psu_proc))
    psu_proc.send_signal(signal.SIGTERM)
    psu_proc.wait(timeout=None)
    return True

def watch_process(observe, kill, seconds=2):
    """Kill the process IDs listed in 'kill', when 'observe' dies."""
    logger = logging.getLogger(__name__)
    logger.setLevel(logging.DEBUG)
    logfile = logging.FileHandler('%s/watchdog.log' % os.getcwd())
    logger.addHandler(logfile)
    formatter = MyFormatter(fmt='%(asctime)s %(message)s',datefmt='%Y-%m-%d,%H:%M:%S.%f')
    logfile.setFormatter(formatter)


    logger.debug('Start watchdog.watch_process')
    logger.debug('    observe pid = %s' % observe)
    logger.debug('    kill pids = %s' % kill)
    logger.debug('    seconds = %s' % seconds)
    children = list()

    # Get PIDs of all child processes...
    for childpid in kill.split(','):
        children.append(childpid)
        p = psutil.Process(int(childpid))
        for subpsu in p.get_children():
            children.append(str(subpsu.pid))

    # Poll observed PID...
    while check_pid(int(observe)):
        logger.debug('Poll PID: %s is alive.' % observe)
        time.sleep(seconds)
    logger.debug('Poll PID: %s is *dead*, starting kills of %s' % (observe, ', '.join(children)))

    for pid in children:
        # kill all child processes...
        kill_process(logger, int(pid))
    sys.exit(0) # Exit gracefully

def run(observe, kill, seconds):

    with daemon.DaemonContext(detach_process=True, 
        stdout=sys.stdout,
        working_directory=os.getcwd()):
        watch_process(observe=observe, kill=kill, seconds=seconds)

if __name__=='__main__':
    parser = OptionParser()
    parser.add_option("-o", "--observe", dest="observe", type="int",
                      help="PID to be observed", metavar="INT")
    parser.add_option("-k", "--kill", dest="kill",
                      help="Comma separated list of PIDs to be killed", 
                      metavar="TEXT")
    parser.add_option("-s", "--seconds", dest="seconds", default=2, type="int",
                      help="Seconds to wait between observations (default = 2)", 
                      metavar="INT")
    (options, args) = parser.parse_args()
    run(options.observe, options.kill, options.seconds)

- Mike Pennington

请问您能否提供watchdog.py的日志记录吗？ - François Févotte

5个回答

1

您可以尝试杀死包含以下内容的完整进程组：父脚本，子脚本，由脚本生成的bash，甚至是telnet进程。

kill(2)手册说：

如果pid小于-1，则将sig发送到ID为-pid的进程组中的每个进程。

因此，kill -TERM -$PID的等效操作将完成任务。

哦，您需要的pid是父脚本的pid。

修改

如果我在 watchdog.py 中调整以下两个函数，进程组杀死似乎对我有用：

def kill_process_group(log, pid):
    log.debug('killing %s' % -pid)
    os.kill(-pid, 15)

    return True

def watch_process(observe, kill, seconds=2):
    """Kill the process IDs listed in 'kill', when 'observe' dies."""
    logger = logging.getLogger(__name__)
    logger.setLevel(logging.DEBUG)
    logfile = logging.FileHandler('%s/watchdog.log' % os.getcwd())
    logger.addHandler(logfile)
    formatter = MyFormatter(fmt='%(asctime)s %(message)s',datefmt='%Y-%m-%d,%H:%M:%S.%f')
    logfile.setFormatter(formatter)

    logger.debug('Start watchdog.watch_process')
    logger.debug('    observe pid = %s' % observe)
    logger.debug('    kill pids = %s' % kill)
    logger.debug('    seconds = %s' % seconds)

    while check_pid(int(observe)):
        logger.debug('PID: %s is alive.' % observe)
        time.sleep(seconds)
    logger.debug('PID: %s is *dead*, starting kills' % observe)

    for pid in kill.split(','):
        # Kill the children...
        kill_process_group(logger, int(pid))
    sys.exit(0) # Exit gracefully

- A.H.

实际上，我发现我的问题是无法使守护进程持续足够长的时间来终止所述的“脚本”会话。有什么想法吗？ - Mike Pennington

是的和不是的：通过strace手动重启看门狗，我发现watchdog.py试图将一些错误写入/dev/null：'UnboundLocalError: local variable 'pid' referenced。您想将守护程序的流重定向到某个日志文件中 :-) - A.H.

0

经检查，似乎 psu_proc.kill()（实际上是 send_signal()）在失败时应该引发 OSError，但以防万一 - 在设置标志之前是否尝试检查终止？例如：

if not psu_proc.is_running():
  finished = True

- Eduardo Ivanec

0

也许你可以使用 os.system() 并在你的看门狗中执行 killall 命令来杀死所有 /usr/bin/script 的实例。

- mikhail

1

“killall” 不是一个可接受的解决方案，因为多个自动化脚本同时运行。 - Mike Pennington

0

问题本质上是一个竞争条件。当我试图杀死“父”脚本进程时，它们已经与自动化事件同时死亡...

为了解决这个问题...首先，看门狗守护进程需要在轮询观察到的PID之前识别出要被杀死的整个子进程列表（我的原始脚本尝试在观察到的PID崩溃后识别子进程）。接下来，我必须修改我的看门狗守护进程，以允许一些脚本进程可能会与观察到的PID一起死亡的可能性。

watchdog.py

#!/usr/bin/python
"""
Implement a cross-platform watchdog daemon, which observes a PID and kills 
other PIDs if the observed PID dies.

Example:
--------

watchdog.py -o 29322 -k 29345,29346,29348 -s 2

The command checks PID 29322 every 2 seconds and kills PIDs 29345, 29346, 29348 
and their children, if PID 29322 dies.

Requires:
----------

 * https://github.com/giampaolo/psutil
 * http://pypi.python.org/pypi/python-daemon
"""
from optparse import OptionParser
import datetime as dt
import signal
import daemon
import logging
import psutil
import time
import sys
import os

class MyFormatter(logging.Formatter):
    converter=dt.datetime.fromtimestamp
    def formatTime(self, record, datefmt=None):
        ct = self.converter(record.created)
        if datefmt:
            s = ct.strftime(datefmt)
        else:
            t = ct.strftime("%Y-%m-%d %H:%M:%S")
            s = "%s,%03d" % (t, record.msecs)
        return s

def check_pid(pid):        
    """ Check For the existence of a unix / windows pid."""
    try:
        os.kill(pid, 0)   # Kill 0 raises OSError, if pid isn't there...
    except OSError:
        return False
    else:
        return True

def kill_process(logger, pid):
    try:
        psu_proc = psutil.Process(pid)
    except Exception, e:
        logger.debug('Caught Exception ["%s"] while looking up PID %s' % (e, pid))
        return False

    logger.debug('Sending SIGTERM to %s' % repr(psu_proc))
    psu_proc.send_signal(signal.SIGTERM)
    psu_proc.wait(timeout=None)
    return True

def watch_process(observe, kill, seconds=2):
    """Kill the process IDs listed in 'kill', when 'observe' dies."""
    logger = logging.getLogger(__name__)
    logger.setLevel(logging.DEBUG)
    logfile = logging.FileHandler('%s/watchdog.log' % os.getcwd())
    logger.addHandler(logfile)
    formatter = MyFormatter(fmt='%(asctime)s %(message)s',datefmt='%Y-%m-%d,%H:%M:%S.%f')
    logfile.setFormatter(formatter)


    logger.debug('Start watchdog.watch_process')
    logger.debug('    observe pid = %s' % observe)
    logger.debug('    kill pids = %s' % kill)
    logger.debug('    seconds = %s' % seconds)
    children = list()

    # Get PIDs of all child processes...
    for childpid in kill.split(','):
        children.append(childpid)
        p = psutil.Process(int(childpid))
        for subpsu in p.get_children():
            children.append(str(subpsu.pid))

    # Poll observed PID...
    while check_pid(int(observe)):
        logger.debug('Poll PID: %s is alive.' % observe)
        time.sleep(seconds)
    logger.debug('Poll PID: %s is *dead*, starting kills of %s' % (observe, ', '.join(children)))

    for pid in children:
        # kill all child processes...
        kill_process(logger, int(pid))
    sys.exit(0) # Exit gracefully

def run(observe, kill, seconds):

    with daemon.DaemonContext(detach_process=True, 
        stdout=sys.stdout,
        working_directory=os.getcwd()):
        watch_process(observe=observe, kill=kill, seconds=seconds)

if __name__=='__main__':
    parser = OptionParser()
    parser.add_option("-o", "--observe", dest="observe", type="int",
                      help="PID to be observed", metavar="INT")
    parser.add_option("-k", "--kill", dest="kill",
                      help="Comma separated list of PIDs to be killed", 
                      metavar="TEXT")
    parser.add_option("-s", "--seconds", dest="seconds", default=2, type="int",
                      help="Seconds to wait between observations (default = 2)", 
                      metavar="INT")
    (options, args) = parser.parse_args()
    run(options.observe, options.kill, options.seconds)

- Mike Pennington

哈哈，问题和自我回答之间有11.5年，肯定是某种记录吧 =D - undefined

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Andrey Nikishaev · Accepted Answer

您的问题是，脚本在生成后没有与自动化脚本分离，因此它作为子进程运行，当父进程死亡时，它就无法被管理。

要处理Python脚本退出，可以使用atexit模块。要监视子进程的退出，可以使用os.wait或处理SIGCHLD信号。

防错的跨平台进程终止守护程序

自动化示例（reproduce_bug.py）

问题：

解决方案

watchdog.py

自动化示例（`reproduce_bug.py`）