Python - 创建层级文件(在以表格表示的树中从根节点到叶子节点找到路径)

6

给定以下无序的Tab分隔文件:

Asia    Srilanka
Srilanka    Colombo
Continents  Europe
India   Mumbai
India   Pune
Continents  Asia
Earth   Continents
Asia    India

生成以下输出(制表符分隔)是目标:
Earth   Continents  Asia    India   Mumbai
Earth   Continents  Asia    India   Pune
Earth   Continents  Asia    Srilanka    Colombo
Earth   Continents  Europe

我已创建以下脚本以达成目标:
root={} # this hash will finally contain the ROOT member from which all the nodes emanate
link={} # this is to hold the grouping of immediate children 
for line in f:
    line=line.rstrip('\r\n')
    line=line.strip()
    cols=list(line.split('\t'))
    parent=cols[0]
    child=cols[1]
    if not parent in link:
        root[parent]=1
    if child in root:
        del root[child]
    if not child in link:
        link[child]={}
    if not parent in link:
        link[parent]={}
    link[parent][child]=1

现在我打算使用之前创建的两个字典(root和link)来输出所需的结果。我不确定如何在Python中实现此操作,但我知道我们可以编写以下代码以在Perl中实现该结果:

print_links($_) for sort keys %root;

sub print_links
{
  my @path = @_;

  my %children = %{$link{$path[-1]}};
  if (%children)
  {
    print_links(@path, $_) for sort keys %children;
  } 
  else 
  {
    say join "\t", @path;
  }
}

你能帮我在 Python 3.x 中实现所需的输出吗?

3个回答

7
我在这里看到下面的问题:
  • 从文件中读取关系;
  • 从关系构建层次结构。
  • 将层次结构写入文件。
假设层次结构树的高度小于默认递归限制(在大多数情况下等于1000),让我们为这些单独的任务定义实用函数。
  1. Parsing of relations can be done with

    def parse_relations(lines):
        relations = {}
        splitted_lines = (line.split() for line in lines)
        for parent, child in splitted_lines:
            relations.setdefault(parent, []).append(child)
        return relations
    
  2. Building hierarchy can be done with

    • Python >=3.5

      def flatten_hierarchy(relations, parent='Earth'):
          try:
              children = relations[parent]
              for child in children:
                  sub_hierarchy = flatten_hierarchy(relations, child)
                  for element in sub_hierarchy:
                      try:
                          yield (parent, *element)
                      except TypeError:
                          # we've tried to unpack `None` value,
                          # it means that no successors left
                          yield (parent, child)
          except KeyError:
              # we've reached end of hierarchy
              yield None
      
    • Python <3.5: extended iterable unpacking was added with PEP-448, but it can be replaced with itertools.chain like

      import itertools
      
      
      def flatten_hierarchy(relations, parent='Earth'):
          try:
              children = relations[parent]
              for child in children:
                  sub_hierarchy = flatten_hierarchy(relations, child)
                  for element in sub_hierarchy:
                      try:
                          yield tuple(itertools.chain([parent], element))
                      except TypeError:
                          # we've tried to unpack `None` value,
                          # it means that no successors left
                          yield (parent, child)
          except KeyError:
              # we've reached end of hierarchy
              yield None
      
  3. Hierarchy export to file can be done with

    def write_hierarchy(hierarchy, path, delimiter='\t'):
        with open(path, mode='w') as file:
            for row in hierarchy:
                file.write(delimiter.join(row) + '\n')
    

用法

假设文件路径为 'relations.txt':

with open('relations.txt') as file:
    relations = parse_relations(file)

给我们

>>> relations
{'Asia': ['Srilanka', 'India'],
 'Srilanka': ['Colombo'],
 'Continents': ['Europe', 'Asia'],
 'India': ['Mumbai', 'Pune'],
 'Earth': ['Continents']}

我们的层级结构是

>>> list(flatten_hierarchy(relations))
[('Earth', 'Continents', 'Europe'),
 ('Earth', 'Continents', 'Asia', 'Srilanka', 'Colombo'),
 ('Earth', 'Continents', 'Asia', 'India', 'Mumbai'),
 ('Earth', 'Continents', 'Asia', 'India', 'Pune')]

最后将其导出到名为 'hierarchy.txt' 的文件中:
>>> write_hierarchy(sorted(hierarchy), 'hierarchy.txt')

我们使用sorted来获得类似于所需输出文件中的层次结构。

P. S.

如果您不熟悉Python生成器,我们可以定义flatten_hierarchy函数,如下:

  • Python >= 3.5

    def flatten_hierarchy(relations, parent='Earth'):
        try:
            children = relations[parent]
        except KeyError:
            # we've reached end of hierarchy
            return None
        result = []
        for child in children:
            sub_hierarchy = flatten_hierarchy(relations, child)
            try:
                for element in sub_hierarchy:
                    result.append((parent, *element))
            except TypeError:
                # we've tried to iterate through `None` value,
                # it means that no successors left
                result.append((parent, child))
        return result
    
  • Python < 3.5

    import itertools
    
    
    def flatten_hierarchy(relations, parent='Earth'):
        try:
            children = relations[parent]
        except KeyError:
            # we've reached end of hierarchy
            return None
        result = []
        for child in children:
            sub_hierarchy = flatten_hierarchy(relations, child)
            try:
                for element in sub_hierarchy:
                    result.append(tuple(itertools.chain([parent], element)))
            except TypeError:
                # we've tried to iterate through `None` value,
                # it means that no successors left
                result.append((parent, child))
        return result
    

谢谢。我发现这行代码出错了:result.append((parent, *element)) SyntaxError: 只能将星号表达式用作赋值目标。 - Sachin S
优秀的解决方案。对于像我这样的 Python 新手来说,您的解决方案提供了很多可以学习和关注的内容。 - Sachin S

1

只需简单几步,我们就可以完成这个任务:

  • 第一步:将数据转换为Dataframe格式
  • 第二步:从第一列中取出不在第二列中的唯一元素
  • 第三步:从第一列中取出唯一元素后,将其转换为Dataframe格式
  • 第四步:使用pd.merge()函数合并Dataframes, 左侧数据框为第一列中的唯一元素, 右侧数据框为我们在第一步中转换的主要数据
  • 第五步:根据所有列删除重复项

0

前提条件:

  1. 数据应该以DataFrame的形式存在,
  2. 必须有两列。

# now we are going to create the function 
def root_to_leaves(data):
    # import library
    import pandas as pd
    # Take the names of first and second columns.
    first_column_name = data.columns[0]
    second_column_name = data.columns[1]
    #XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    # Take a unique element from column 1 which is not in column 2.
    # We use set difference operation.
    A = set(data[first_column_name])
    B = set(data[second_column_name])
    C = list(A - B)
    # m0 means nothing but variable name.
    m0 = pd.DataFrame({'stage_1': C})
    #XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    # first merge data
    data = data.rename(columns ={first_column_name:'stage_1',second_column_name:'stage_2'})
    m1 = pd.merge(m0, data , on = 'stage_1', how = 'left')
    data = data.rename(columns = {'stage_1':'stage_2','stage_2':'stage_3'})
    # count of nan
    count_of_nan = 0
    i = 0
    while (count_of_nan != m1.shape[0]):
        on_variable = "stage_"+str(i+2)
        m2 = pd.merge(m1, data , on = on_variable, how = 'left')
        data = data.rename(columns = {'stage_'+str(i+2)+'':'stage_'+str(i+3)+'','stage_'+str(i+3)+'':'stage_'+str(i+4)+''})
        m1 = m2
        i = i + 1
        count_of_nan = m1.iloc[:,-1].isnull().sum()
    final_data = m1.iloc[:,:-1]
    return final_data

# you can find the result in the data_result
data_result = root_to_leaves(data)


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接