Python - 创建层级文件（在以表格表示的树中从根节点到叶子节点找到路径）

Question

Python - 创建层级文件（在以表格表示的树中从根节点到叶子节点找到路径）

6

给定以下无序的Tab分隔文件：

Asia    Srilanka
Srilanka    Colombo
Continents  Europe
India   Mumbai
India   Pune
Continents  Asia
Earth   Continents
Asia    India

生成以下输出（制表符分隔）是目标：

Earth   Continents  Asia    India   Mumbai
Earth   Continents  Asia    India   Pune
Earth   Continents  Asia    Srilanka    Colombo
Earth   Continents  Europe

我已创建以下脚本以达成目标：

root={} # this hash will finally contain the ROOT member from which all the nodes emanate
link={} # this is to hold the grouping of immediate children 
for line in f:
    line=line.rstrip('\r\n')
    line=line.strip()
    cols=list(line.split('\t'))
    parent=cols[0]
    child=cols[1]
    if not parent in link:
        root[parent]=1
    if child in root:
        del root[child]
    if not child in link:
        link[child]={}
    if not parent in link:
        link[parent]={}
    link[parent][child]=1

现在我打算使用之前创建的两个字典（root和link）来输出所需的结果。我不确定如何在Python中实现此操作，但我知道我们可以编写以下代码以在Perl中实现该结果：

print_links($_) for sort keys %root;

sub print_links
{
  my @path = @_;

  my %children = %{$link{$path[-1]}};
  if (%children)
  {
    print_links(@path, $_) for sort keys %children;
  } 
  else 
  {
    say join "\t", @path;
  }
}

你能帮我在 Python 3.x 中实现所需的输出吗？

- Sachin S

3个回答

1

只需简单几步，我们就可以完成这个任务：

第一步：将数据转换为Dataframe格式
第二步：从第一列中取出不在第二列中的唯一元素
第三步：从第一列中取出唯一元素后，将其转换为Dataframe格式
第四步：使用pd.merge()函数合并Dataframes，左侧数据框为第一列中的唯一元素，右侧数据框为我们在第一步中转换的主要数据
第五步：根据所有列删除重复项

- sharuk khan

0

前提条件：

数据应该以DataFrame的形式存在，
必须有两列。


# now we are going to create the function 
def root_to_leaves(data):
    # import library
    import pandas as pd
    # Take the names of first and second columns.
    first_column_name = data.columns[0]
    second_column_name = data.columns[1]
    #XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    # Take a unique element from column 1 which is not in column 2.
    # We use set difference operation.
    A = set(data[first_column_name])
    B = set(data[second_column_name])
    C = list(A - B)
    # m0 means nothing but variable name.
    m0 = pd.DataFrame({'stage_1': C})
    #XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    # first merge data
    data = data.rename(columns ={first_column_name:'stage_1',second_column_name:'stage_2'})
    m1 = pd.merge(m0, data , on = 'stage_1', how = 'left')
    data = data.rename(columns = {'stage_1':'stage_2','stage_2':'stage_3'})
    # count of nan
    count_of_nan = 0
    i = 0
    while (count_of_nan != m1.shape[0]):
        on_variable = "stage_"+str(i+2)
        m2 = pd.merge(m1, data , on = on_variable, how = 'left')
        data = data.rename(columns = {'stage_'+str(i+2)+'':'stage_'+str(i+3)+'','stage_'+str(i+3)+'':'stage_'+str(i+4)+''})
        m1 = m2
        i = i + 1
        count_of_nan = m1.iloc[:,-1].isnull().sum()
    final_data = m1.iloc[:,:-1]
    return final_data

# you can find the result in the data_result
data_result = root_to_leaves(data)

- sharuk khan

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Azat Ibrakov · Accepted Answer

我在这里看到下面的问题：

从文件中读取关系；
从关系构建层次结构。
将层次结构写入文件。

假设层次结构树的高度小于默认递归限制（在大多数情况下等于1000），让我们为这些单独的任务定义实用函数。

Parsing of relations can be done with

def parse_relations(lines):
    relations = {}
    splitted_lines = (line.split() for line in lines)
    for parent, child in splitted_lines:
        relations.setdefault(parent, []).append(child)
    return relations

Building hierarchy can be done with

Python >=3.5

def flatten_hierarchy(relations, parent='Earth'):
    try:
        children = relations[parent]
        for child in children:
            sub_hierarchy = flatten_hierarchy(relations, child)
            for element in sub_hierarchy:
                try:
                    yield (parent, *element)
                except TypeError:
                    # we've tried to unpack `None` value,
                    # it means that no successors left
                    yield (parent, child)
    except KeyError:
        # we've reached end of hierarchy
        yield None

Python <3.5: extended iterable unpacking was added with PEP-448, but it can be replaced with itertools.chain like

import itertools


def flatten_hierarchy(relations, parent='Earth'):
    try:
        children = relations[parent]
        for child in children:
            sub_hierarchy = flatten_hierarchy(relations, child)
            for element in sub_hierarchy:
                try:
                    yield tuple(itertools.chain([parent], element))
                except TypeError:
                    # we've tried to unpack `None` value,
                    # it means that no successors left
                    yield (parent, child)
    except KeyError:
        # we've reached end of hierarchy
        yield None

Hierarchy export to file can be done with

def write_hierarchy(hierarchy, path, delimiter='\t'):
    with open(path, mode='w') as file:
        for row in hierarchy:
            file.write(delimiter.join(row) + '\n')

用法

假设文件路径为 'relations.txt':

with open('relations.txt') as file:
    relations = parse_relations(file)

给我们

>>> relations
{'Asia': ['Srilanka', 'India'],
 'Srilanka': ['Colombo'],
 'Continents': ['Europe', 'Asia'],
 'India': ['Mumbai', 'Pune'],
 'Earth': ['Continents']}

我们的层级结构是

>>> list(flatten_hierarchy(relations))
[('Earth', 'Continents', 'Europe'),
 ('Earth', 'Continents', 'Asia', 'Srilanka', 'Colombo'),
 ('Earth', 'Continents', 'Asia', 'India', 'Mumbai'),
 ('Earth', 'Continents', 'Asia', 'India', 'Pune')]

最后将其导出到名为 'hierarchy.txt' 的文件中：

>>> write_hierarchy(sorted(hierarchy), 'hierarchy.txt')

我们使用sorted来获得类似于所需输出文件中的层次结构。

P. S.

如果您不熟悉Python生成器，我们可以定义flatten_hierarchy函数，如下：

Python >= 3.5

def flatten_hierarchy(relations, parent='Earth'):
    try:
        children = relations[parent]
    except KeyError:
        # we've reached end of hierarchy
        return None
    result = []
    for child in children:
        sub_hierarchy = flatten_hierarchy(relations, child)
        try:
            for element in sub_hierarchy:
                result.append((parent, *element))
        except TypeError:
            # we've tried to iterate through `None` value,
            # it means that no successors left
            result.append((parent, child))
    return result

Python < 3.5

import itertools


def flatten_hierarchy(relations, parent='Earth'):
    try:
        children = relations[parent]
    except KeyError:
        # we've reached end of hierarchy
        return None
    result = []
    for child in children:
        sub_hierarchy = flatten_hierarchy(relations, child)
        try:
            for element in sub_hierarchy:
                result.append(tuple(itertools.chain([parent], element)))
        except TypeError:
            # we've tried to iterate through `None` value,
            # it means that no successors left
            result.append((parent, child))
    return result