我该如何使用Nokogiri漂亮地打印HTML？

Question

我该如何使用Nokogiri漂亮地打印HTML？

htmlrubynokogiripretty-print

31

我使用Ruby写了一个网络爬虫，并且正在使用Nokogiri::HTML解析页面。我需要打印出该页面，当我在IRB中试验时，我发现有一个pretty_print方法，但是它需要一个参数，而我无法确定它想要什么。

我的爬虫正在缓存网页的HTML，并将其写入本地机器上的文件中。我希望能够"漂亮地打印"这些HTML，以便它们在打印时看起来美观并正确地格式化。

- Jarsen

1

你想要打印什么？HTML内容（包括标签）还是选择的项目？每种情况都有不同的方法，澄清一下会有助于回答。 - user214028

8个回答

19

我猜您所说的"漂亮打印"是指您想要使用正确的缩进重新格式化HTML结构。Nokogiri不支持此功能；pretty_print方法是针对"pp"库的，其输出仅用于调试。

有几个项目能够充分理解HTML，并能够重新格式化它而不破坏实际上具有意义的空格（其中著名的一个是HTML Tidy），但通过谷歌搜索，我找到了这篇题为"Pretty printing XHTML with Nokogiri and XSLT"的帖子。

它归结为以下内容：

xsl = Nokogiri::XSLT(File.open("pretty_print.xsl"))
html = Nokogiri(File.open("source.html"))
puts xsl.apply_to(html).to_s

当然，这需要您将链接的XSL文件下载到您的文件系统中。我在我的机器上非常快速地尝试过它，它效果很棒。

- mislav

这里提到的样式表可能会导致渲染的HTML中出现空格（例如<p><span>pre</span>fix</p> 变成了 "pre fix"）。 - Kyle McClellan

9

这对我有用：

 pretty_html = Nokogiri::HTML(html).to_xhtml(indent: 3)

我试过上面的REXML版本，但它损坏了我的一些文件。而且我不想在新项目中使用XSLT，两者都感觉过时了。 :)

- bronson

这很好，但如果缺少<body>和<html>标签，则会添加它们。在我的情况下，我根本不需要它们。 - kravc

4

您可以尝试使用REXML：

require "rexml/document"

doc = REXML::Document.new(xml)
doc.write($stdout, 2)

- Julien

2

我的解决方案是在实际的Nokogiri对象上添加一个print方法。在运行下面代码片段之后，您只需编写node.print，它将漂亮地打印内容。无需xslt :-)

Nokogiri::XML::Node.class_eval do
  # Print every Node by default (will be overridden by CharacterData)
  define_method :should_print? do
    true
  end

  # Duplicate this node, replace the contents of the duplicated node with a
  # newline. With this content substitution, the #to_s method conveniently
  # returns a string with the opening tag (e.g. `<a href="foo">`) on the first
  # line and the closing tag on the second (e.g. `</a>`, provided that the
  # current node is not a self-closing tag).
  #
  # Now, print the open tag preceded by the correct amount of indentation, then
  # recursively print this node's children (with extra indentation), and then
  # print the close tag (if there is a closing tag)
  define_method :print do |indent=0|
    duplicate = self.dup
    duplicate.content = "\n"
    open_tag, close_tag = duplicate.to_s.split("\n")

    puts (" " * indent) + open_tag
    self.children.select(&:should_print?).each { |child| child.print(indent + 2) }
    puts (" " * indent) + close_tag if close_tag
  end
end

Nokogiri::XML::CharacterData.class_eval do
  # Only print CharacterData if there's non-whitespace content
  define_method :should_print? do
    content =~ /\S+/
  end

  # Replace all consecutive whitespace characters by a single space; precede the
  # outut by a certain amount of indentation; print this text.
  define_method :print do |indent=0|
    puts (" " * indent) + to_s.strip.sub(/\s+/, ' ')
  end
end

- pariser

你有这个东西被使用的例子吗？我尝试了一下但出现了“TypeError：no implicit conversion of nil into String”的错误，可能是我在错误的对象上调用它。 - ian

经过几次尝试，我终于让它工作了：doc = Nokogiri::HTML(html_source); doc.elements.each {|elem| elem.print }。谢谢。 - ian

1

更简单，而且运行良好

puts Nokogiri::HTML(File.read('terms.fr.html')).to_xhtml

- Dorian

0

我知道我回答这个问题已经非常晚了，但是我还是会留下答案。我尝试了上面的所有步骤，它确实在某种程度上起作用。

Nokogiri 确实可以格式化 HTML，但不关心关闭或打开标签，因此漂亮的格式化就无从谈起了。

我发现了一个叫做 htmlbeautifier 的宝石，它的效果非常好。我希望其他仍在寻找答案的人也能从中受益。

- Abeid Ahmed

-6

你为什么不试试pp方法呢？

require 'pp'
pp some_var

- khelll

4

尽管 Nokogiri 实现了帮助“漂亮打印”的方法，但其输出仅供开发人员使用。我认为 Jarsen 想要显示漂亮打印的 HTML 源代码。 - mislav

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Phrogz · Accepted Answer

@mislav的回答有点错误。Nokogiri确实支持漂亮的打印，如果：

将文档解析为XML

指示Nokogiri在解析期间忽略仅包含空格的节点（"blanks"）

使用to_xhtml或to_xml来指定漂亮打印参数

如下所示:
html = '<section> <h1>Main Section 1</h1><p>Intro</p> <section> <h2>Subhead 1.1</h2><p>Meat</p><p>MOAR MEAT</p> </section><section> <h2>Subhead 1.2</h2><p>Meat</p> </section></section>' require 'nokogiri' doc = Nokogiri::XML(html,&:noblanks) puts doc #=> <section> #=> <h1>Main Section 1</h1> #=> <p>Intro</p> #=> <section> #=> <h2>Subhead 1.1</h2> #=> <p>Meat</p> #=> <p>MOAR MEAT</p> #=> </section> #=> <section> #=> <h2>Subhead 1.2</h2> #=> <p>Meat</p> #=> </section> #=> </section> puts doc.to_xhtml( indent:3, indent_text:"." ) #=> <section> #=> ...<h1>Main Section 1</h1> #=> ...<p>Intro</p> #=> ...<section> #=> ......<h2>Subhead 1.1</h2> #=> ......<p>Meat</p> #=> ......<p>MOAR MEAT</p> #=> ...</section> #=> ...<section> #=> ......<h2>Subhead 1.2</h2> #=> ......<p>Meat</p> #=> ...</section> #=> </section>