如何从Pandas DataFrame绘制家谱树?

7

我有一张表格,用来存储关于我的祖先的信息。作为示范,我创建了一张类似的表格,受教父电影启发。

  |--------+---+-------------+-----------+------+------+--------+--------+----------------+----------------|
  | ID     | S | First name  | Last name |  DoB |  DoD | FID    | MID    | Place of birth | Job            |
  |--------+---+-------------+-----------+------+------+--------+--------+----------------+----------------|
  | AnAn   | M | Antonio     | Andolini  |      | 1901 |        |        | Corleone       |                |
  | SiAn   | F | Signora     | Andolini  |      | 1901 |        |        | Corleone       | housewife      |
  | PaAn87 | M | Paolo       | Andolini  | 1887 | 1901 | AnAn   | SiAn   |                |                |
  | ViCo92 | M | Vito        | Corleone  | 1892 | 1954 | AnAn   | SiAn   | Corleone       | godfather      |
  | CaCo97 | F | Carmella    | Corleone  | 1897 | 1959 |        |        |                |                |
  | ToHa10 | M | Tom         | Hagen     | 1910 | 1970 | ViCo92 | CaCo97 | New York       | Consigliere    |
  | SaCo16 | M | Santino     | Corleone  | 1916 | 1948 | ViCo92 | CaCo97 | New York       | gangster       |
  | SaCo17 | F | Sandra      | Colombo   | 1917 |      |        |        | Messina        |                |
  | FrCo19 | M | Frederico   | Corleone  | 1919 | 1959 | ViCo92 | CaCo97 | New York       | Casino Manager |
  | MiCo20 | M | Michael     | Corleone  | 1920 | 1997 | ViCo92 | CaCo97 | New York       | godfather      |
  | ThHa20 | F | Theresa     | Hagen     | 1920 |      |        |        | New Jersey     | Art expert     |
  | LuMa23 | F | Lucy        | Mancini   | 1923 |      |        |        |                | Hotel employee |
  | KaAd24 | F | Kay         | Adams     | 1934 |      |        |        |                |                |
  | FrCo37 | F | Francessa   | Corleone  | 1937 |      | SaCo16 | SaCo17 |                |                |
  | KaCo37 | F | Kathryn     | Corleone  | 1937 |      | SaCo16 | SaCo17 |                |                |
  | FrCo40 | F | Frank       | Corleone  | 1940 |      | SaCo16 | SaCo17 |                |                |
  | SaCo45 | M | Santino Jr. | Corleone  | 1945 |      | SaCo16 | SaCo17 |                |                |
  | FrHa   | M | Frank       | Hagen     | 1940 |      | ToHa10 | Th20   |                |                |
  | AnHa42 | M | Andrew      | Hagen     | 1942 |      | ToHa10 | Th20   |                | Priest         |
  | ViMa   | M | Vincent     | Mancini   | 1948 |      | SaCo16 | LuMa23 | New York       | Godfather      |
  | GiHa58 | F | Gianna      | Hagen     | 1948 |      | ToHa10 | Th20   |                |                |
  | AnCo51 | M | Anthony     | Corleone  | 1951 |      | MiCo20 | KaAd24 | New York       | Singer         |
  | MaCo53 | F | Mary        | Corleone  | 1953 | 1979 | MiCo20 | KaAd24 | New York       | Student        |
  | ChHa54 | F | Christina   | Hagen     | 1954 |      | ToHa10 | Th20   |                |                |
  | CoCo27 | F | Constanzia  | Corleone  | 1927 |      | ViCo92 | CaCo97 | New York       | rentier        |
  | CaRi20 | M | Carlo       | Rizzi     | 1920 | 1955 |        |        | Nevada         | Bookmaker      |
  | ViRi49 | M | Victor      | Rizzi     | 1949 |      | CaRi20 | CoCo27 | New York       |                |
  | MiRi   | M | Michael     | Rizzi     | 1955 |      | CaRi20 | CoCo27 |                |                |
  |--------+---+-------------+-----------+------+------+--------+--------+----------------+----------------|

在这里,个人之间的关系可以理解为一个有向无环图(DAG)。我的目标是使用图形绘制将这个表格可视化为一个家谱树。

首先,我将该表格转换为一个边列表,其中 ID 是起点,ParentID 是终点:

import pandas as pd
rawdf = pd.read_csv('corleone.csv')
el1 = rawdf[['ID','MID']]
el2 = rawdf[['ID','FID']]
el1.columns = ['Child', 'ParentID']
el2.columns = el1.columns
el = pd.concat([el1, el2])
el = el.dropna()
df = el.merge(rawdf, left_index=True, right_index=True, how='left')
df['name'] = df[df.columns[4:6]].apply(lambda x: ' '.join(x.dropna().astype(str)),axis=1)
df = df.drop(['Child','FID', 'MID', 'First name', 'Last name'], axis=1)
df = df[['ID', 'name', 'S', 'DoB', 'DoD', 'Place of birth', 'Job', 'ParentID']]

以下是生成的数据框:

|--------+----------------------+---+--------+--------+----------------+----------------+----------|
| ID     | name                 | S |    DoB |    DoD | Place of birth | Job            | ParentID |
|--------+----------------------+---+--------+--------+----------------+----------------+----------|
| PaAn87 | Paolo Andolini       | M | 1887.0 | 1901.0 | NaN            | NaN            | SiAn     |
| PaAn87 | Paolo Andolini       | M | 1887.0 | 1901.0 | NaN            | NaN            | AnAn     |
| ViCo92 | Vito Corleone        | M | 1892.0 | 1954.0 | Corleone       | godfather      | SiAn     |
| ViCo92 | Vito Corleone        | M | 1892.0 | 1954.0 | Corleone       | godfather      | AnAn     |
| ToHa10 | Tom Hagen            | M | 1910.0 | 1970.0 | New York       | Consigliere    | CaCo97   |
| ToHa10 | Tom Hagen            | M | 1910.0 | 1970.0 | New York       | Consigliere    | ViCo92   |
| SaCo16 | Santino Corleone     | M | 1916.0 | 1948.0 | New York       | gangster       | CaCo97   |
| SaCo16 | Santino Corleone     | M | 1916.0 | 1948.0 | New York       | gangster       | ViCo92   |
| FrCo19 | Frederico Corleone   | M | 1919.0 | 1959.0 | New York       | Casino Manager | CaCo97   |
| FrCo19 | Frederico Corleone   | M | 1919.0 | 1959.0 | New York       | Casino Manager | ViCo92   |
| MiCo20 | Michael Corleone     | M | 1920.0 | 1997.0 | New York       | godfather      | CaCo97   |
| MiCo20 | Michael Corleone     | M | 1920.0 | 1997.0 | New York       | godfather      | ViCo92   |
| FrCo37 | Francessa Corleone   | F | 1937.0 |    NaN | NaN            | NaN            | SaCo17   |
| FrCo37 | Francessa Corleone   | F | 1937.0 |    NaN | NaN            | NaN            | SaCo16   |
| KaCo37 | Kathryn Corleone     | F | 1937.0 |    NaN | NaN            | NaN            | SaCo17   |
| KaCo37 | Kathryn Corleone     | F | 1937.0 |    NaN | NaN            | NaN            | SaCo16   |
| FrCo40 | Frank Corleone       | F | 1940.0 |    NaN | NaN            | NaN            | SaCo17   |
| FrCo40 | Frank Corleone       | F | 1940.0 |    NaN | NaN            | NaN            | SaCo16   |
| SaCo45 | Santino Jr. Corleone | M | 1945.0 |    NaN | NaN            | NaN            | SaCo17   |
| SaCo45 | Santino Jr. Corleone | M | 1945.0 |    NaN | NaN            | NaN            | SaCo16   |
| FrHa   | Frank Hagen          | M | 1940.0 |    NaN | NaN            | NaN            | Th20     |
| FrHa   | Frank Hagen          | M | 1940.0 |    NaN | NaN            | NaN            | ToHa10   |
| AnHa42 | Andrew Hagen         | M | 1942.0 |    NaN | NaN            | Priest         | Th20     |
| AnHa42 | Andrew Hagen         | M | 1942.0 |    NaN | NaN            | Priest         | ToHa10   |
| ViMa   | Vincent Mancini      | M | 1948.0 |    NaN | New York       | Godfather      | LuMa23   |
| ViMa   | Vincent Mancini      | M | 1948.0 |    NaN | New York       | Godfather      | SaCo16   |
| GiHa58 | Gianna Hagen         | F | 1948.0 |    NaN | NaN            | NaN            | Th20     |
| GiHa58 | Gianna Hagen         | F | 1948.0 |    NaN | NaN            | NaN            | ToHa10   |
| AnCo51 | Anthony Corleone     | M | 1951.0 |    NaN | New York       | Singer         | KaAd24   |
| AnCo51 | Anthony Corleone     | M | 1951.0 |    NaN | New York       | Singer         | MiCo20   |
| MaCo53 | Mary Corleone        | F | 1953.0 | 1979.0 | New York       | Student        | KaAd24   |
| MaCo53 | Mary Corleone        | F | 1953.0 | 1979.0 | New York       | Student        | MiCo20   |
| ChHa54 | Christina Hagen      | F | 1954.0 |    NaN | NaN            | NaN            | Th20     |
| ChHa54 | Christina Hagen      | F | 1954.0 |    NaN | NaN            | NaN            | ToHa10   |
| CoCo27 | Constanzia Corleone  | F | 1927.0 |    NaN | New York       | rentier        | CaCo97   |
| CoCo27 | Constanzia Corleone  | F | 1927.0 |    NaN | New York       | rentier        | ViCo92   |
| ViRi49 | Victor Rizzi         | M | 1949.0 |    NaN | New York       | NaN            | CoCo27   |
| ViRi49 | Victor Rizzi         | M | 1949.0 |    NaN | New York       | NaN            | CaRi20   |
| MiRi   | Michael Rizzi        | M | 1955.0 |    NaN | NaN            | NaN            | CoCo27   |
| MiRi   | Michael Rizzi        | M | 1955.0 |    NaN | NaN            | NaN            | CaRi20   |
|--------+----------------------+---+--------+--------+----------------+----------------+----------|

然后,我使用graphviz生成一张有向无环图(DAG):

from graphviz import Digraph
f = Digraph('neato', format='pdf', encoding='utf8', filename='corleone', node_attr={'color': 'lightblue2', 'style': 'filled'})
f.attr('node', shape='box')
for index, row in df.iterrows():
    f.edge(str(row["ParentID"]), str(row["ID"]), label='')
f.view()

这个看起来像这样: 这个看起来像这样

我面临的问题是,有很多方面需要修改,例如:

  • 男性和女性使用不同的颜色
  • 将ID改为名称
  • 箭头要看起来像家谱图中的箭头
  • 能够在每个框中添加其他信息,例如出生日期、死亡日期等

我不知道是否可以用Graphviz实现这些修改(在文档中找不到),如果不行,我会对如何实现这些修改的想法感兴趣。


不知何故,f.view() 对我抛出了异常... - Quang Hoang
我编辑了代码,在for循环后添加了缺失的缩进。现在可以工作了吗?如果还不行,异常信息是什么? - crocefisso
异常提示期望一个字符串或字节对象。此外,您可以使用f.node...先添加节点。 - Quang Hoang
当我运行代码时,没有出现错误,我使用的是Python 3.8.5。您说的通过f.node首先添加节点是什么意思? - crocefisso
我看到f.node允许添加带有属性和标签的节点。您可以遍历数据并在添加边缘之前将节点添加到图形中,但这样做无法按照您的要求自定义节点。 - Quang Hoang
抱歉,我不理解...你能发一些代码吗? - crocefisso
2个回答

4

我改进了绘图,但它仍未达到我的预期。因此,以下是带有一些修改说明的代码。

  • 将空单元格保留为空,而不是NaN
    • keep_default_na=False
  • ParentID中的每个空值替换为特定字符串:
    • el.replace('', np.nan, regex=True, inplace = True)
    • t = pd.DataFrame({'tmp':['no_entry'+str(i) for i in range(el.shape[0])]})
    • el['ParentID'].fillna(t['tmp'], inplace=True)
import pandas as pd
import numpy as np
rawdf = pd.read_csv('corleone.csv', keep_default_na=False)
el1 = rawdf[['ID','MID']]
el2 = rawdf[['ID','FID']]
el1.columns = ['Child', 'ParentID']
el2.columns = el1.columns
el = pd.concat([el1, el2])
el.replace('', np.nan, regex=True, inplace = True)
t = pd.DataFrame({'tmp':['no_entry'+str(i) for i in range(el.shape[0])]})
el['ParentID'].fillna(t['tmp'], inplace=True)
df = el.merge(rawdf, left_index=True, right_index=True, how='left')
df['name'] = df[df.columns[4:6]].apply(lambda x: ' '.join(x.dropna().astype(str)),axis=1)
df = df.drop(['Child','FID', 'MID', 'First name', 'Last name'], axis=1)
df = df[['ID', 'name', 'S', 'DoB', 'DoD', 'Place of birth', 'Job', 'ParentID']]
  • 将具有相同起始和结束节点并具有正方形边缘的边缘分组
    • graph_attr={"concentrate": "true", "splines":"ortho"})
  • 显示具有name,job,DoB,Place of birth,DoD的节点信息
    • label=...
  • 根据性别定义节点颜色
    • _attributes={'color':'lightpink' if row['S']=='F' else 'lightblue'if row['S']=='M' else 'lightgray'}
from graphviz import Digraph
f = Digraph('neato', format='jpg', encoding='utf8', filename='corleone', node_attr={'style': 'filled'},  graph_attr={"concentrate": "true", "splines":"ortho"})
f.attr('node', shape='box')
for index, row in df.iterrows():
    f.node(row['ID'],
           label=
             row['name']
              + '\n' + 
             row['Job'] 
             + '\n'+ 
             row['DoB'] 
             + '\n' + 
             row['Place of birth']
             + '\n†' + 
             row['DoD'],
           _attributes={'color':'lightpink' if row['S']=='F' else 'lightblue'if row['S']=='M' else 'lightgray'})
for index, row in df.iterrows():
    f.edge(str(row["ParentID"]), str(row["ID"]), label='')  
f.view()

结果如下所示:Famiglia Corleone 这样看起来好多了。不过,仍然存在两个主要缺陷:
  1. 父母与子女之间的边都被分开了,应该像这样 enter image description here
  2. 我无法删除不必要的换行符和死亡符号

1

Here's what I mean:

f = Digraph('neato', format='pdf', encoding='utf8',
            filename='corleone', node_attr={'color': 'lightblue2', 'style': 'filled'})
f.attr('node', shape='box')

# create all the possible nodes first
# you can modify the `label` 
for index, row in el.iterrows():
    f.node(row['ID'],label=row['First name'] + ' '+ row['Last name'], 
           _attributes={'color':'red' if row['S']=='M' else 'lightblue2'}
          )

for index, row in df.iterrows():
    f.edge(str(row["ParentID"]), str(row["ID"]), label='')

    
f.view()

我能够得到类似这样的东西。您可以进一步修改它:

enter image description here


谢谢,通过修改您的代码,我能够指定颜色并显示大多数名称(根节点的名称仍然显示为ID)。请参见https://i.ibb.co/B4mwZ6R/corleone.jpgfor index, row in df.iterrows(): f.node(row['ID'],label=row['name'], _attributes={'color':'lightpink' if row['S']=='F' else 'lightblue'}). - crocefisso

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接