PostgreSQL触发器函数中的死锁问题

4
使用postgres 9.3,我有一个名为regression_runs的表,该表存储一些计数器。当此表中的行被更新、插入或删除时,会调用触发器函数以更新nightly_runs表中的一行,以便针对具有给定ID的所有regression_runs保持这些计数器的运行总数。我采取的方法相当广泛地记录在案。然而,我的问题是,在多个进程同时尝试使用相同的nightly_run_id在regression_runs表中插入新行时,我遇到了死锁问题。
regression_runs表如下所示:
regression=> \d regression_runs
                                      Table "public.regression_runs"
     Column      |           Type           |                          Modifiers                           
-----------------+--------------------------+--------------------------------------------------------------
 id              | integer                  | not null default nextval('regression_runs_id_seq'::regclass)
 username        | character varying(16)    | not null
 nightly_run_id  | integer                  | 
 nightly_run_pid | integer                  | 
 passes          | integer                  | not null default 0
 failures        | integer                  | not null default 0
 errors          | integer                  | not null default 0
 skips           | integer                  | not null default 0
Indexes:
    "regression_runs_pkey" PRIMARY KEY, btree (id)
    "regression_runs_nightly_run_id_idx" btree (nightly_run_id)
Foreign-key constraints:
    "regression_runs_nightly_run_id_fkey" FOREIGN KEY (nightly_run_id) REFERENCES nightly_runs(id) ON UPDATE CASCADE ON DELETE CASCADE
Triggers:
    regression_run_update_trigger AFTER INSERT OR DELETE OR UPDATE ON regression_runs FOR EACH ROW EXECUTE PROCEDURE regression_run_update()

夜间运行表看起来像这样:

regression=> \d nightly_runs
                                    Table "public.nightly_runs"
   Column   |           Type           |                         Modifiers                         
------------+--------------------------+-----------------------------------------------------------
 id         | integer                  | not null default nextval('nightly_runs_id_seq'::regclass)
 passes     | integer                  | not null default 0
 failures   | integer                  | not null default 0
 errors     | integer                  | not null default 0
 skips      | integer                  | not null default 0
Indexes:
    "nightly_runs_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "regression_runs" CONSTRAINT "regression_runs_nightly_run_id_fkey" FOREIGN KEY (nightly_run_id) REFERENCES nightly_runs(id) ON UPDATE CASCADE ON DELETE CASCADE

触发器函数regression_run_update如下所示:
CREATE OR REPLACE FUNCTION regression_run_update() RETURNS "trigger"
    AS $$
        BEGIN
        IF TG_OP = 'UPDATE' THEN
                IF (NEW.nightly_run_id IS NOT NULL) and (NEW.nightly_run_id = OLD.nightly_run_id) THEN
                        UPDATE nightly_runs SET passes = passes + (NEW.passes - OLD.passes), failures = failures + (NEW.failures - OLD.failures), errors = errors + (NEW.errors - OLD.errors), skips = skips + (NEW.skips - OLD.skips) WHERE id = NEW.nightly_run_id;
                ELSE
                        IF NEW.nightly_run_id IS NOT NULL THEN
                                UPDATE nightly_runs SET passes = passes + NEW.passes, failures = failures + NEW.failures, errors = errors + NEW.errors, skips = skips + NEW.skips WHERE id = NEW.nightly_run_id;
                        END IF;
                        IF OLD.nightly_run_id IS NOT NULL THEN
                                UPDATE nightly_runs SET passes = passes - OLD.passes, failures = failures - OLD.failures, errors = errors - OLD.errors, skips = skips - OLD.skips WHERE id = OLD.nightly_run_id;
                        END IF;
                END IF;
        ELSIF TG_OP = 'INSERT' THEN
                IF NEW.nightly_run_id IS NOT NULL THEN
                        UPDATE nightly_runs SET passes = passes + NEW.passes, failures = failures + NEW.failures, errors = errors + NEW.errors, skips = skips + NEW.skips WHERE id = NEW.nightly_run_id;
                END IF;
        ELSIF TG_OP = 'DELETE' THEN
                IF OLD.nightly_run_id IS NOT NULL THEN
                        UPDATE nightly_runs SET passes = passes - OLD.passes, failures = failures - OLD.failures, errors = errors - OLD.errors, skips = skips - OLD.skips WHERE id = OLD.nightly_run_id;
                END IF;
        END IF;
        RETURN NEW;
        END;
$$  
    LANGUAGE plpgsql;

我在PostgreSQL日志文件中看到的内容如下:

ERROR:  deadlock detected
DETAIL:  Process 20266 waits for ShareLock on transaction 7520; blocked by process 20263.
        Process 20263 waits for ExclusiveLock on tuple (1,70) of relation 18469 of database 18354; blocked by process 20266.
        Process 20266: insert into regression_runs (username, nightly_run_id, nightly_run_pid) values ('tbeadle', 135, 20262);
        Process 20263: insert into regression_runs (username, nightly_run_id, nightly_run_pid) values ('tbeadle', 135, 20260);
HINT:  See server log for query details.
CONTEXT:  SQL statement "UPDATE nightly_runs SET passes = passes + NEW.passes, failures = failures + NEW.failures, errors = errors + NEW.errors, skips = skips + NEW.skips WHERE id = NEW.nightly_run_id"
        PL/pgSQL function regression_run_update() line 16 at SQL statement
STATEMENT:  insert into regression_runs (username, nightly_run_id, nightly_run_pid) values ('tbeadle', 135, 20262);

我可以使用这个脚本重现问题:
#!/usr/bin/env python

import os
import multiprocessing
import psycopg2

class Foo(object):
    def child(self):
        pid = os.getpid()
        conn = psycopg2.connect(
            'dbname=regression host=localhost user=regression')
        cur = conn.cursor()
        for i in xrange(100):
            cur.execute(
                "insert into regression_runs "
                "(username, nightly_run_id, nightly_run_pid) "
                "values "
                "('tbeadle', %s, %s);", (self.nid, pid))
            conn.commit()
        return

    def start(self):
        conn = psycopg2.connect(
            'dbname=regression host=localhost user=regression')
        cur = conn.cursor()
        cur.execute('insert into nightly_runs default values returning id;')
        row = cur.fetchone()
        conn.commit()
        self.nid = row[0]
        procs = []
        for child in xrange(5):
            procs.append(multiprocessing.Process(target=self.child))
        for proc in procs:
            proc.start()
        for proc in procs:
            proc.join()

Foo().start()

我无法弄清楚死锁是如何发生的,也不知道该怎么做才能解决它。请帮忙!


1
在我看来,触发器内更新字段是一个不好的主意。因为触发器触发者经常试图写入一行数据,这导致死锁。也许需要进行模式更改。对于难处理的情况,我会创建缓冲队列表并通过存储过程进行派遣。当然,我们使用外部工具来调节队列。 - corvinusz
1
@corvinusz:胡说八道。触发器是OP正在做的事情的理想工具。他只是不知道一些需要注意的地方。 - Denis de Bernardy
1个回答

5

往往情况下,死锁发生是因为与OLD和NEW相关的更新没有按照一致的顺序执行。例如:

IF TG_OP = 'UPDATE' THEN
  IF (NEW.nightly_run_id IS NOT NULL) AND (NEW.nightly_run_id = OLD.nightly_run_id) THEN
    -- stuff that seems fine
  ELSE
    IF NEW.nightly_run_id IS NOT NULL THEN
      UPDATE nightly_runs ... WHERE id = NEW.nightly_run_id; -- lock
    END IF;
    IF OLD.nightly_run_id IS NOT NULL THEN
      UPDATE nightly_runs ... WHERE id = OLD.nightly_run_id; -- lock
    END IF;

想象一下有两个事务:

  • T1 获取了 new.nightly_run_id = 1 的锁,并等待 old.nightly_run_id = 2 的锁
  • T2 获取了 new.nightly_run_id = 2 的锁,并等待 old.nightly_run_id = 1 的锁

死锁产生了...

调整执行顺序来避免这种情况:

IF OLD.nightly_run_id = NEW.nightly_run_id THEN
  -- stuff that seems fine
ELSIF OLD.nightly_run_id < NEW.nightly_run_id THEN
  UPDATE nightly_runs ... WHERE id = OLD.nightly_run_id;
  UPDATE nightly_runs ... WHERE id = NEW.nightly_run_id;
ELSEIF NEW.nightly_run_id < OLD.nightly_run_id THEN
  UPDATE nightly_runs ... WHERE id = NEW.nightly_run_id;
  UPDATE nightly_runs ... WHERE id = OLD.nightly_run_id;
ELSEIF OLD.nightly_run_id IS NOT NULL THEN
  UPDATE nightly_runs ... WHERE id = OLD.nightly_run_id;
ELSEIF NEW.nightly_run_id IS NOT NULL THEN
  UPDATE nightly_runs ... WHERE id = NEW.nightly_run_id;
END IF;

如果适用,您的其他触发器也应该进行类似的更改。除了代码中的其他病理情况外,死锁问题应该得到解决。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接