寻找生成树(使用WITH RECURSIVE,PostgreSQL 9.5)

4

我有一个任意人数的身份(即别名)表格。每行都有一个先前的名称和一个新名称。在生产环境中,有大约100万行。例如:

id, old, new
---
1, 'Albert', 'Bob'
2, 'Bob', 'Charles'
3, 'Mary', 'Nancy'
4, 'Charles', 'Albert'
5, 'Lydia', 'Nancy'
6, 'Zoe', 'Zoe'

我希望生成“用户”列表,并引用它们各自的身份信息。这类似于查找连接身份图中的所有节点,或查找生成森林。
User 1: Albert, Bob, Charles (identities: 1,2,4)
User 2: Mary, Nancy, Lydia (identities: 3,5)
User 3: Zoe (identities: 6)

我一直在尝试使用PostgreSQL的WITH RECURSIVE,但它会产生每个集合和子集。例如:

1,2,4 <-- spanning tree: good
2     <-- subset: discard
3,5   <-- spanning tree: good
4     <-- subset: discard
5     <-- subset: discard
6     <-- spanning tree: good

我需要做什么才能为每个用户只生成完整的身份集(即生成树)?
SQLFiddle:http://sqlfiddle.com/#!15/9eaed/4 这是我的最新尝试。以下是代码:
WITH RECURSIVE search_graph AS (
   SELECT id
    , id AS min_id
    , ARRAY[id] AS path
    , ARRAY[old,new] AS emails
   FROM   identities

   UNION 

   SELECT identities.id
    , LEAST(identities.id, sg.min_id)
    , (sg.path || identities.id)
    , (sg.emails || identities.old || identities.new)

   FROM search_graph sg
   JOIN identities ON (identities.old = ANY(sg.emails) OR identities.new = ANY(sg.emails))
   WHERE  identities.id <> ALL(sg.path)
)
SELECT array_agg(DISTINCT(p)) from search_graph, unnest(path) p GROUP BY min_id;

而且结果如下:

1,2,4
2
3,5
4
5
6

我有一种感觉,子集出现在结果中是因为它们不是其他中间结果的完全副本,因此它们没有被消除。这是因为我将冗余信息保存到“search_graph”中,并且没有对“path”的内容进行排序。 - David Carney
“distinct” 不是一个函数。 - user330315
1个回答

2

我之前回答过一个类似的问题:如何找到无向图的所有连通子图。在那个问题中,我使用了SQL Server。详细解释中间CTE请参考那个答案。我将该查询适应于Postgres。

可以使用Postgres数组功能来更有效地编写它,而不是将路径连接成text列。

WITH RECURSIVE
CTE_Idents
AS
(
    SELECT old AS Ident
    FROM identities

    UNION

    SELECT new AS Ident
    FROM identities
)
,CTE_Pairs
AS
(
    SELECT old AS Ident1, new AS Ident2
    FROM identities
    WHERE old <> new

    UNION

    SELECT new AS Ident1, old AS Ident2
    FROM identities
    WHERE old <> new
)
,CTE_Recursive
AS
(
    SELECT
        CTE_Idents.Ident AS AnchorIdent 
        , Ident1
        , Ident2
        , ',' || Ident1 || ',' || Ident2 || ',' AS IdentPath
        , 1 AS Lvl
    FROM 
        CTE_Pairs
        INNER JOIN CTE_Idents ON CTE_Idents.Ident = CTE_Pairs.Ident1

    UNION ALL

    SELECT 
        CTE_Recursive.AnchorIdent 
        , CTE_Pairs.Ident1
        , CTE_Pairs.Ident2
        , CTE_Recursive.IdentPath || CTE_Pairs.Ident2 || ',' AS IdentPath
        , CTE_Recursive.Lvl + 1 AS Lvl
    FROM
        CTE_Pairs
        INNER JOIN CTE_Recursive ON CTE_Recursive.Ident2 = CTE_Pairs.Ident1
    WHERE
        CTE_Recursive.IdentPath NOT LIKE ('%,' || CTE_Pairs.Ident2 || ',%')
)
,CTE_RecursionResult
AS
(
    SELECT AnchorIdent, Ident1, Ident2
    FROM CTE_Recursive
)
,CTE_CleanResult
AS
(
    SELECT AnchorIdent, Ident1 AS Ident
    FROM CTE_RecursionResult

    UNION

    SELECT AnchorIdent, Ident2 AS Ident
    FROM CTE_RecursionResult
)
,CTE_Groups
AS
(
  SELECT
    CTE_Idents.Ident
    ,array_agg(COALESCE(CTE_CleanResult.Ident, CTE_Idents.Ident) 
        ORDER BY COALESCE(CTE_CleanResult.Ident, CTE_Idents.Ident)) AS AllIdents
  FROM
    CTE_Idents
    LEFT JOIN CTE_CleanResult ON CTE_CleanResult.AnchorIdent = CTE_Idents.Ident
  GROUP BY CTE_Idents.Ident
)
SELECT AllIdents
FROM CTE_Groups
GROUP BY AllIdents
;

我在你的示例数据中添加了一行(7,X,Y)

SQL Fiddle

结果

|          allidents |
|--------------------|
|   Lydia,Mary,Nancy |
| Albert,Bob,Charles |
|                X,Y |
|                Zoe |

1
这太棒了(而且完美运作)!非常感谢您提供的帮助。非常有启发性。 - David Carney

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接