从HTMLDocument中获取所有的HTML字符串

Question

从HTMLDocument中获取所有的HTML字符串

8

我正在编写Java代码...

有人知道如何将javax.swing.text.html.HTMLDocument的内容作为字符串获取吗？这是我目前为止的成果...

URL url = new URL( "http://www.test.com" );

HTMLEditorKit kit = new HTMLEditorKit(); 
HTMLDocument doc = (HTMLDocument) kit.createDefaultDocument(); 
doc.putProperty("IgnoreCharsetDirective", Boolean.TRUE);
Reader HTMLReader = new InputStreamReader(url.openConnection().getInputStream()); 
kit.read(HTMLReader, doc, 0);

我需要HTMLDocument的内容作为字符串。

例如：

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">    <html><head><meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1">

......等等。

任何帮助都将不胜感激。我需要使用HTMLDocument类以便正确地处理HTML :)

谢谢Daniel

- Zelleriation

2个回答

1

你根本不需要编辑器和阅读器 - 只需读取输入流。例如，使用 commons-io 的 IOUtils.toString(inputStream)

或者你可以使用：

Content content = document.getContent();
String str = content.getString(0, content.length() - 1);

- Bozho

这样做不起作用，因为继承的getContent方法是受保护的。 - Parker

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Joop Eggen · Accepted Answer

StringWriter writer = new StringWriter();
kit.write(writer, doc, 0, doc.getLength());
String s = writer.toString();