从HTML解析器获取HTML子树

Question

从HTML解析器获取HTML子树

3

我正在使用Python的HTMLparser，尝试获取特定节点中包含的HTML子树。我的通用解析器已经可以很好地完成工作，一旦找到感兴趣的标记，我想要将该节点中的数据提供给另一个特定的HTMLParser。

以下是我想要实现的示例：

class genericParser(HTMLParser):
   def __init__ (self):
       HTMLParser.__init__(self)
       self.divFound = False

   def handle_starttag (self, tag, attrs):
       if tag == "div" and ("class", "good") in attrs:
           self.divFound = True

   def handle_data (self, data):
       if self.divFound:
           print data    ## print nothing
           parser = specificParser ()
           parser.feed (data)
           self.divFound = False

并使用类似以下内容的genericParser：

并将其提供给genericParser：

<html>
<head></head>
<body>
   <div class='good'>
      <ul>
         <li>test1</li>
         <li>test2</li>
      </ul>
   </div>
</body>
</html>

但是在Python的HTMLParser.handle_data文档中：

该方法用于处理任意数据（例如文本节点以及 <script>...</script> 和 <style>...</style> 的内容）。

在我的 genericParser 中，handle_data 中得到的数据为空，因为我的 <div class='good'> 不是一个文本节点。

如何使用HTMLParser检索我的div内部的原始HTML数据？

提前感谢您的帮助。

- Marcassin

使用DOM解析器提取子树会更容易。你必须使用HTMLParser吗？ - Birei

我尝试使用HTMLParser，因为项目的大部分已经使用它完成了，但我发现在解析子树时出现了问题。最终，我开始将HTML树记录在缓冲区中，在handle_endtag()结束感兴趣的块时使用它。这不是我想到的解决方案，但我不再卡住了。感谢您的建议。 - Marcassin

那么，你已经解决了吗？ - Birei

是的，我会提供解决方案，但我会等几个小时看看是否有比缓冲HTML更好的解决方案。 - Marcassin

1

我问你是因为我想用BeautifulSoup提取子树，然后调用你的specificParser。如果你被困在HTMLParser中，我的想法也是记录每个节点，直到关闭</div>，但我看到你已经在处理它了。 - Birei

@Birei：不幸的是，我被迫使用HTMLParser，因为我的GenericParser太大了，重写成BeautifulSoup会浪费很多时间。在开始开发之前，我应该做更多的研究。非常感谢您的建议。 - Marcassin

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Marcassin · Accepted Answer

我通过缓存遇到的所有数据来解决了这个问题，涉及HTML节点。

这个方法可以工作，但不是很“干净”，因为GenericParser必须在将整个有趣的块解析之后才能将其馈送给SpecificParser。以下是一种“轻量级”（没有任何错误处理）的解决方案：

class genericParser(HTMLParser):
   def __init__ (self):
       HTMLParser.__init__ (self)
       self.divFound = False
       self.buff = ""
       self.level = 0

   def computeRecord (self, tag, attrs):
        mystr = "<" + tag + " "
        for att, val in attrs:
            mystr += att+"='"+val+ "' "
        mystr += ">"
        return mystr

   def handle_starttag (self, tag, attrs):
       if tag == "div" and ("class", "good") in attrs:
           self.divFound = True
       elif self.divFound:
          self.level += 1
          self.buff += self.computeRecord (tag, attrs)

   def handle_data (self, data):
       if self.divFound:
          self.buff += data


   def handle_endtag (self, tag):
      if self.divFound:
         self.buff += "</" + tag + ">"
         self.level -= 1
         if (self.level == 0):
            self.divFound = False
            print self.buff

输出结果如期望的一样：

<ul>
     <li>test1</li>
     <li>test2</li>
</ul>

正如Birei在评论中所说，使用BeautifulSoup提取子树会更容易。

soup = BeaufitulSoup (html)
div = soup("div", {"class" : "good"})
children = div[0].findChildren ()
print children[0]   #### desired output