在Python中查找所有后代节点

Question

在Python中查找所有后代节点

10

我需要获取所有以side_a - side_b表示的链接的子代坐标（在一个数据帧中），直到达到每个side_a的end_point为止（在另一个数据帧中）。因此：

df1:
side_a   side_b
  a        b
  b        c
  c        d
  k        l
  l        m
  l        n
  p        q
  q        r
  r        s

df2:
side_a    end_point
  a          c
  b          c
  c          c
  k          m
  k          n
  l          m
  l          n
  p          s
  q          s
  r          s

目标是从df2中获取对于每个side_a值的所有点，直到达到end_point。如果它具有两个end_point值（例如"k"），则应该有两个列表。

我有一些代码，但它没有按照这种方法编写，它会删除所有df1中的行，如果df1['side_a'] == df2['end_points']，这会导致某些问题。但是如果有人想要我发布代码，当然可以。

期望的输出类型可能类似于这样:

side_a    end_point
  a          [b, c]
  b          [c]
  c          [c]
  k          [l, m]
  k          [l, n]
  l          [m]
  l          [n]
  p          [q, r, s]
  q          [r, s]
  r          [s]

还有一点需要注意，如果两边相同的话，那么这个点根本不需要列出来，我可以随后再添加，无论哪种方式更容易。

import pandas as pd
import numpy as np
import itertools

def get_child_list(df, parent_id):
    list_of_children = []
    list_of_children.append(df[df['side_a'] == parent_id]['side_b'].values)
    for c_, r_ in df[df['side_a'] == parent_id].iterrows():
        if r_['side_b'] != parent_id:
            list_of_children.append(get_child_list(df, r_['side_b']))

    # to flatten the list 
    list_of_children =  [item for sublist in list_of_children for item in sublist]
    return list_of_children

new_df = pd.DataFrame(columns=['side_a', 'list_of_children'])
for index, row in df1.iterrows():
    temp_df = pd.DataFrame(columns=['side_a', 'list_of_children'])
    temp_df['list_of_children'] = pd.Series(get_child_list(df1, row['side_a']))
    temp_df['side_a'] = row['side_a']

    new_df = new_df.append(temp_df)

所以，这段代码的问题在于，如果我删除df2中side_a等于end_point的行，则可以运行。我不知道如何实现条件，即如果在side_b列中捕获df2，则停止，不再继续。任何帮助或提示在这里都受到欢迎，真的谢谢你们。提前致谢。

- jovicbg

1

你当然记得这不是一个“请帮我写代码”的网站吧？你能展示一下你的工作吗？你的代码出了什么问题？ - rsm

@rsm 正如我所说，我可以发布我的代码，但这会使帖子变得非常庞大，而且我不认为任何助手会使用它。你可以写一个评论，让我添加我的代码，我会添加的，只是不想显得傲慢。 - jovicbg

能否请您展示一下您的工作，并添加任何相关(!)的代码？并解释一下您遇到的问题？如果您希望我们为您的问题提出算法和实现 - 这不是本网站的工作方式。 - rsm

@jovicbg：df2 中有一个拼写错误：q 的 end_point 应该是 r，而不是 s。我可能有一个简单的解决方案来解决你的问题，但它在大型数据框上的性能不佳。你的数据框大约有多大？ - Qusai Alothman

也许我漏掉了什么，但 d 不是 a、b 和 c 的终点（或“后代”）吗？ - JohnE

显示剩余3条评论

3个回答

3

你的规则不一致，定义也不够清晰，因此可能需要在这里和那里添加一些约束条件，因为不清楚你究竟要求什么。通过将数据结构组织适应问题并构建更强大的遍历函数（如下所示），可以更容易地根据需要添加/编辑约束条件 - 并完全解决问题。

将 df 转换为 dict 以更好地表示树形结构

如果将数据结构转换为更符合问题直觉的形式，而不是试图在当前结构的背景下解决问题，则该问题会变得简单得多。

## Example dataframe
df = pd.DataFrame({'side_a':['a','b','c','k','l','l','p','q','r'],'side_b':['b','c','d','l','m','n','q','r','s']})

## Instantiate blank tree with every item
all_items = set(list(df['side_a']) + list(df['side_b']))
tree = {ii : set() for ii in all_items}

## Populate the tree with each row
for idx, row in df.iterrows():
    tree[row['side_a']] =  set(list(tree[row['side_a']]) + list(row['side_b']))

遍历树

现在数据结构很直观，所以这变得更加简单了。任何标准的深度优先搜索算法（带路径保存）都可以解决问题。我修改了链接中的算法来适应这个例子。

编辑：再次阅读时，看起来您在endpoint中有一个搜索终止条件（您需要在问题中更清楚地说明输入和输出是什么）。您可以调整dfs_path(tree,**target**, root)并更改终止条件，以仅返回正确的路径。

## Standard DFS pathfinder
def dfs_paths(tree, root):
    stack = [(root, [root])]
    while stack:
        (node, path) = stack.pop()
        for nextNode in tree[node] - set(path):
            # Termination condition. 
            ### I set it to terminate search at the end of each path.
            ### You can edit the termination condition to fit the 
            ### constraints of your goal
            if not tree[nextNode]:
                yield set(list(path) + list(nextNode)) - set(root)
            else:
                stack.append((nextNode, path + [nextNode]))

从我们生成的内容构建数据框

如果您对生成器不是非常熟悉，可以将DFS遍历结构化，使其输出为列表而不是生成器。

set_a = []
end_points = []
gen_dict = [{ii:dfs_paths(tree,ii)} for ii in all_items]
for gen in gen_dict:
    for row in list(gen.values()).pop():
        set_a.append(list(gen.keys()).pop())
        end_points.append(row)
                      
## To dataframe
df_2 = pd.DataFrame({'set_a':set_a,'end_points':end_points}).sort_values('set_a')

输出

df_2[['set_a','end_points']]


set_a   end_points
a       {b, c, d}
b       {c, d}
c       {d}
k       {n, l}
k       {m, l}
l       {n}
l       {m}
p       {s, r, q}
q       {s, r}
r       {s}

- Brendan Frick

我会尽快尝试这个的。df2中的列end_point表示每个side_a的迭代应该停止的位置。因此，A的end_point不应包含D。在df2中，对于A来说，end_point是C。也许会有些困惑，因为我也将输出列命名为end_point。对此我很抱歉。 - jovicbg

我添加的注释应该可以解决这个问题。只需为端点添加一个参数，并使终止条件为nextNode == endPoint。 - Brendan Frick

这能用于数字数据类型，而不是字符串吗？ - jovicbg

我遇到了一个错误：KeyErrorTraceback（最近的调用最后） <ipython-input-45-29d3775fa7ea> in <module>() 3 gen_dict = [{ii:dfs_paths(tree,ii)} for ii in all_items] 4 for gen in gen_dict: ----> 5 for row in list(gen.values()).pop(): 6 set_a.append(list(gen.keys()).pop()) 7 end_points.append(row)KeyError：'1' - jovicbg

我刚刚运行了发布的完全相同的代码，并输出了正确的值。你是否在某个时候命名了一个变量为 list？ - Brendan Frick

不，我没有，现在我只是得到了这行代码的错误 '0' ----> for row in list(gen.values()).pop():。 - jovicbg

2

如果您可以接受一个额外的导入，那么这可以被视为在图上的路径问题，并且可以使用NetworkX在几行代码中解决：

import networkx

g = networkx.DiGraph(zip(df1.side_a, df1.side_b))

outdf = df2.apply(lambda row: [row.side_a, 
                               set().union(*networkx.all_simple_paths(g, row.side_a, row.end_point)) - {row.side_a}], 
                  axis=1)

outdf 的格式如下。请注意，该格式包含集合而不是列表，与您期望的输出格式不同 - 这样可以以简单的方式组合所有路径。

  side_a  end_point
0      a     {c, b}
1      b        {c}
2      c         {}
3      k     {l, m}
4      k     {l, n}
5      l        {m}
6      l        {n}
7      p  {r, q, s}
8      q     {r, s}
9      r        {s}

- chthonicdaemon

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Scott Boston · Accepted Answer

您可以使用networkx库和图形：

import networkx as nx
G = nx.from_pandas_edgelist(df, source='side_a',target='side_b')
df2.apply(lambda x: [nx.shortest_path(G, x.side_a,x.end_point)[0],
                     nx.shortest_path(G, x.side_a,x.end_point)[1:]], axis=1)

输出：

  side_a  end_point
0      a     [b, c]
1      b        [c]
2      c         []
3      k     [l, m]
4      k     [l, n]
5      l        [m]
6      l        [n]
7      p  [q, r, s]
8      q     [r, s]
9      r        [s]