提取HTML标签外的文本

Question

提取HTML标签外的文本

3

我有以下的HTML代码：

<div class=example>Text #1</div> "Another Text 1"
<div class=example>Text #2</div> "Another Text 2"

我想提取标签外的文本，“Another Text 1”和“Another Text 2”，我使用JSoup来实现这个功能。有什么好的想法吗？谢谢！

- johnny243

2个回答

2

您可以选择每个

标签的下一个Node（而不是Element！）。在您的示例中，它们都是TextNode。

final String html = "<div class=example>Text #1</div> \"Another Text 1\"\n"
                  + "<div class=example>Text #2</div> \"Another Text 2\" ";

Document doc = Jsoup.parse(html);

for( Element element : doc.select("div.example") ) // Select all the div tags
{
    TextNode next = (TextNode) element.nextSibling(); // Get the next node of each div as a TextNode

    System.out.println(next.text()); // Print the text of the TextNode
}

输出：

 "Another Text 1" 
 "Another Text 2"

- ollo

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- ashatte · Accepted Answer

一种解决方案是使用ownText()方法（请参见Jsoup 文档）。此方法仅返回指定元素所拥有的文本，并忽略其直接子元素所拥有的任何文本。

只使用您提供的html，您可以提取<body>的ownText:

String html = "<div class='example'>Text #1</div> 'Another Text 1'<div class='example'>Text #2</div> 'Another Text 2'";

Document doc = Jsoup.parse(html);
System.out.println(doc.body().ownText());

将输出：

'Another Text 1' 'Another Text 2'

请注意，ownText()方法可用于任何Element。在文档中还有另一个示例。