Python 2.7使用HTML解析树

Question

Python 2.7使用HTML解析树

pythonpython-2.7beautifulsoupparse-treeetetoolkit

3

我尝试配置下面的HTML表格的解析树，但无法形成它。我想看看树结构是什么样子的！有人可以在这里帮帮我吗？

# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

编辑

Microsoft Windows [Version 6.1.7600]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Users\matt>easy_install ete2
Searching for ete2
Reading http://pypi.python.org/simple/ete2/
Reading http://ete.cgenomics.org
Reading http://ete.cgenomics.org/releases/ete2/
Reading http://ete.cgenomics.org/releases/ete2
Best match: ete2 2.1rev539
Downloading http://ete.cgenomics.org/releases/ete2/ete2-2.1rev539.tar.gz
Processing ete2-2.1rev539.tar.gz
Running ete2-2.1rev539\setup.py -q bdist_egg --dist-dir c:\users\arupra~1\appdat
a\local\temp\easy_install-sypg3x\ete2-2.1rev539\egg-dist-tmp-zemohm

Installing ETE (A python Environment for Tree Exploration).

Checking dependencies...
numpy cannot be found in your python installation.
Numpy is required for the ArrayTable and ClusterTree classes.
MySQLdb cannot be found in your python installation.
MySQLdb is required for the PhylomeDB access API.
PyQt4 cannot be found in your python installation.
PyQt4 is required for tree visualization and image rendering.
lxml cannot be found in your python installation.
lxml is required from Nexml and Phyloxml support.

However, you can still install ETE without such functionality.
Do you want to continue with the installation anyway? [y,n]y
Your installation ID is: d33ba3b425728e95c47cdd98acda202f
warning: no files found matching '*' under directory '.'
warning: no files found matching '*.*' under directory '.'
warning: manifest_maker: MANIFEST.in, line 4: path 'doc/ete_guide/' cannot end w
ith '/'

warning: manifest_maker: MANIFEST.in, line 5: path 'doc/' cannot end with '/'

warning: no previously-included files matching '*.pyc' found under directory '.'

zip_safe flag not set; analyzing archive contents...
Adding ete2 2.1rev539 to easy-install.pth file
Installing ete2 script to C:\Python27\Scripts

Installed c:\python27\lib\site-packages\ete2-2.1rev539-py2.7.egg
Processing dependencies for ete2
Finished processing dependencies for ete2

- Arup Rakshit

1

@Oded，我猜是用Python :) - allergic

@Oded 我只是想看看它的树形结构是什么样子。基本上我正在使用一个Python包，它将html文档处理为解析树。因此，我想看看它的树形结构。如果你能帮忙，我会很感激！ - Arup Rakshit

1

@Oded 我只是想看看它在“树形结构”中的样子？就这样。不需要像Python一样的树。Python也会按标准方式生成它。它应该是一个自顶向下的解析树。 - Arup Rakshit

1

为什么不“编辑”问题并将这些细节添加到其中呢？ - Oded

我之前已经解释过了 - 我不懂Python。其他懂的人或许可以帮助你。但你真的应该编辑问题，把所有相关信息都列出来。 - Oded

显示剩余3条评论

2个回答

7

Python模块：
1. ETE，但它需要Newick格式的数据。
2. GraphViz+pydot。参见此SO答案。

Javascript：
神奇的d3 TreeLayout使用JSON格式。

如果您正在使用ETE，则需要将html转换为Newick格式。这里是我制作的一个小例子：

from lxml import html
from urllib import urlopen


def getStringFromNode(node):
    # Customize this according to
    # your requirements.
    node_string = node.tag
    if node.get('id'):
        node_string += '-' + node.get('id')
    if node.get('class'):
        node_string += '-' + node.get('class')
    return node_string


def xmlToNewick(node):
    node_string = getStringFromNode(node)
    nwk_children = []
    for child in node.iterchildren():
        nwk_children.append(xmlToNewick(child))
    if nwk_children:
        return "(%s)%s" % (','.join(nwk_children), node_string)
    else:
        return node_string


def main():
    html_page = html.fromstring(urlopen('http://www.google.co.in').read())
    newick_page = xmlToNewick(html_page)
    return newick_page

main()

输出（以newick格式）：http://www.google.co.in

'((meta,title,script,style,style,script)head,(script,textarea-csi,(((b-gb1,a-gb1,a-gb1,a-gb1,a-gb1,a-gb1,a-gb1,a-gb1,(u)a-gb1)nobr)div-gbar,((span-gbn-gbi,span-gbf-gbf,span-gbe,a-gb4,a-gb4,a-gb_70-gb4)nobr)div-guser,div-gbh,div-gbh)div-mngb,(br-lgpd,(((div)div-hplogo)div,br)div-lga,(((td,(input,input,input,(input-lst)div-ds,br,((input-lsb)span-lsbb)span-ds,((input-lsb)span-lsbb)span-ds)td,(a,a)td-fl sblc)tr)table,input-gbv)form,div-gac_scont,(br,((a,a,a,a,a,a,a,a,a)font-addlang,br,br)div-als)div,(((a,a,a,a,a-fehl)div-fll)div,(a)p)span-footer)center,div-xjsd,(script)div-xjsi,script)body)html'

在那之后，您可以像示例中展示的那样使用ETE。

希望这有所帮助。

- vivek

@PythonLikeYOU 你的主要挑战将是将HTML解析成Newick格式。 - vivek

@vivek这样做会从HTML文档中生成一棵树吗？ - Arup Rakshit

@PythonLikeYOU 或许这个链接会有用？http://www.philipbjorge.com/archived_wp_blog/www.philipbjorge.com/2011/12/13/taxonomic-tree-visualization-and-nested-list-parsing/index.html - Alex L

1

@PythonLikeYOU 更新了答案，并提供了将htm转换为newick格式的代码。 - vivek

@vivek，我已经安装了，但是出现了上述错误。这需要正确工作吗？Numpy，PyQt4没有安装，这些是运行你的代码所需的吗？ - Arup Rakshit

显示剩余8条评论

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- tzelleke · Accepted Answer

这个答案有点晚，但我仍然想分享一下：我使用了networkx和lxml（它们可以更优雅地遍历DOM树）。然而，树形布局取决于安装了graphviz和pygraphviz。networkx本身只会在画布上分布节点。代码实际上比所需的要长，因为我自己绘制标签以使其框起来（networkx提供了绘制标签的功能，但它不会将bbox关键字传递给matplotlib）。

import networkx as nx
from lxml import html
import matplotlib.pyplot as plt
from networkx.drawing.nx_agraph import graphviz_layout

raw = "...your raw html"

def traverse(parent, graph, labels):
    labels[parent] = parent.tag
    for node in parent.getchildren():
        graph.add_edge(parent, node)
        traverse(node, graph, labels)

G = nx.DiGraph()
labels = {}     # needed to map from node to tag
html_tag = html.document_fromstring(raw)
traverse(html_tag, G, labels)

pos = graphviz_layout(G, prog='dot')

label_props = {'size': 16,
               'color': 'black',
               'weight': 'bold',
               'horizontalalignment': 'center',
               'verticalalignment': 'center',
               'clip_on': True}
bbox_props = {'boxstyle': "round, pad=0.2",
              'fc': "grey",
              'ec': "b",
              'lw': 1.5}

nx.draw_networkx_edges(G, pos, arrows=True)
ax = plt.gca()

for node, label in labels.items():
        x, y = pos[node]
        ax.text(x, y, label,
                bbox=bbox_props,
                **label_props)

ax.xaxis.set_visible(False)
ax.yaxis.set_visible(False)
plt.show()

如果您喜欢（或已经使用）BeautifulSoup，则需要更改代码：

我不是专家...刚刚第一次看BS4,...但它可以工作：

#from lxml import html
from bs4 import BeautifulSoup
from bs4.element import NavigableString

...

def traverse(parent, graph, labels):
    labels[hash(parent)] = parent.name
    for node in parent.children:
        if isinstance(node, NavigableString):
            continue
        graph.add_edge(hash(parent), hash(node))
        traverse(node, graph, labels)

...

#html_tag = html.document_fromstring(raw)
soup = BeautifulSoup(raw)
html_tag = next(soup.children)

...