使用HTML解析器（org.htmlparser）提取和清理HTML片段

Question

使用HTML解析器（org.htmlparser）提取和清理HTML片段

javasoftware-designhtml-parsing

8

我正在寻找一种高效的方法来从网页中提取HTML片段并对该HTML片段执行特定操作。

所需操作包括：

1. 删除所有带有"class = hidden"的标签 2. 删除所有脚本标记 3. 删除所有样式标记 4. 删除所有事件属性（on*="*"） 5. 删除所有样式属性

我一直在使用HTML解析器 (org.htmlparser) 来完成这项任务，并已能够满足所有要求，但是我觉得我的解决方案不够优雅。目前，我使用Css选择器节点过滤器（以获取片段），然后使用节点访问者重新解析该片段以执行清理操作。

请问有人能建议如何解决这个问题吗？我希望只解析文档一次并在该解析期间执行所有操作。

先感谢您！

- Kieran Hall

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- maerics · Accepted Answer

请查看jsoup - 它能够以优雅的方式处理你所需的所有任务。

[编辑]

以下是符合您所需操作的完整可工作示例：

// Load and parse the document fragment.
File f = new File("myfile.html"); // See also Jsoup#parseBodyFragment(s)
Document doc = Jsoup.parse(f, "UTF-8", "http://example.com");

// Remove all script and style elements and those of class "hidden".
doc.select("script, style, .hidden").remove();

// Remove all style and event-handler attributes from all elements.
Elements all = doc.select("*");
for (Element el : all) { 
  for (Attribute attr : el.attributes()) { 
    String attrKey = attr.getKey();
    if (attrKey.equals("style") || attrKey.startsWith("on")) { 
      el.removeAttr(attrKey);
    } 
  }
}
// See also - doc.select("*").removeAttr("style");

您需要确保属性名称的大小写不敏感，但这应该是您需要的大部分内容。