Cheerio - 获取文本并用空格替换HTML标签

Question

Cheerio - 获取文本并用空格替换HTML标签

3

今天我们将使用 Cheerio's，特别是 .text() 方法来从 html 输入中提取文本。

但是当 html 是

<div>
  By<div><h2 class="authorh2">John Smith</h2></div>
</div>

在页面上，单词“by”后的/div确保有空格或换行。但是当应用cheerio text()时，我们得到了错误的结果：ByJohn smith => 这是错误的，因为我们需要在By和john之间有一个空格。

一般来说，是否可能以一种特殊的方式获取文本，以便任何HTML标记都被替换为一个空格。（我可以稍后修剪所有多个空格...）

我们希望输出为By John smith。

- Mathieu

也许与问题无关，但是你的HTML示例是无效的，因为包围John Smith的div标签都是闭合标签。 - cYrixmorten

不相关于真正的问题。谢谢，已更正打字错误。 - Mathieu

看起来你只是没有应用正确的选择器。使用你已经使用的选择器，加上 h2 来单独获取标题的内容。 - html_programmer

@Mathieu 你必须使用cheerio吗？ - Maik Lowrey

7个回答

2

你可以使用纯JavaScript来完成这个任务。

const parent = document.querySelector('div');
console.log(parent.innerText.replace(/(\r\n|\n|\r)/gm, " "))

<div>
  By<div><h2 class="authorh2">John Smith</h2></div>
</div>

- Maik Lowrey

1

一般来说，有没有可能以特殊的方式获取文本，使得任何HTML标记都被替换为一个空格。(如果需要，我可以之后去掉多个空格...) 只需在所有标记前后添加' '即可。

$("*").each(function (index) {
    $(this).prepend(' ');
    $(this).append(' ');
});

然后处理多个空格：

$.text().replace(/\s{2,}/g, ' ').trim();
//=> "By John Smith"

由于 cheerio 只是针对 NodeJS 的 jQuery 实现，您可能会发现这些答案也很有用。

工作示例：

const cheerio = require('cheerio');
const $ = cheerio.load(`
    <div>
        By<div><h2 class="authorh2">John Smith</h2></div>
    </div>
`);

$("*").each(function (index) {
    $(this).prepend(' ');
    $(this).append(' ');
});

let raw = $.text();
//=> "        By  John Smith" (duplicate spaces)

let trimmed = raw.replace(/\s{2,}/g, ' ').trim();
//=> "By John Smith"

- Maksim I. Kuzmin

0

你可以使用 htmlparser2 替代 cheerio。它允许你为解析 HTML 时遇到的每个开放标签、文本或关闭标签定义回调方法。

这段代码将产生你想要的输出字符串：

const htmlparser = require('htmlparser2');

let markup = `<div>
By<div><h2 class="authorh2">John Smith</h2></div>
</div>`;

var parts = [];
var parser = new htmlparser.Parser({
    onopentag: function(name, attributes){
        parts.push(' ');
    },
    ontext: function(text){
        parts.push(text);
    },
    onclosetag: function(tagName){
    // no-op
    }
}, {decodeEntities: true});

parser.write(markup);
parser.end();

// Join the parts and replace all occurances of 2 or more
// spaces with a single space.
const result = parts.join('').replace(/\ {2,}/g, ' ');

console.log(result); // By John Smith

这是另一个关于如何使用它的例子：https://runkit.com/jfahrenkrug/htmlparser2-demo/1.0.0

- Johannes Fahrenkrug

0

Cheerio的text()方法主要用于从爬取中获取干净的文本。正如您已经体验到的那样，这与将HTML页面转换为纯文本有所不同。如果您只需要文本进行索引，则使用正则表达式替换添加空格将起作用。对于其他一些场景，例如转换为音频，它并不总是有效，因为您需要区分空格和换行符。

我的建议是使用一个将HTML转换为markdown的库。其中一个选项是turndown。

var TurndownService = require('turndown')

var turndownService = new TurndownService()
var markdown = turndownService.turndown('<div>\nBy<div><h2>John Smith</h2></div></div>')

这将打印出：

'By\n\nJohn Smith\n----------'

最后一行是因为H2标题。Markdown更容易清理，您可能只需要删除URL和图像。文本显示也更容易被人类阅读。

- kgiannakakis

0

如果您想要一个干净的文本表示形式，我建议使用lynx（由Project Gutenberg使用）或pandoc。这两个工具都可以安装并通过spawn从节点调用。它们将提供比运行puppeteer并使用textContent或innerText更清晰的文本表示形式。

您还可以尝试遍历DOM并根据节点类型添加换行符。

import "./styles.css";
import cheerio from "cheerio";

const NODE_TYPES = {
  TEXT: "text",
  ELEMENT: "tag"
};

const INLINE_ELEMENTS = [
  "a",
  "abbr",
  "acronym",
  "audio",
  "b",
  "bdi",
  "bdo",
  "big",
  "br",
  "button",
  "canvas",
  "cite",
  "code",
  "data",
  "datalist",
  "del",
  "dfn",
  "em",
  "embed",
  "i",
  "iframe",
  "img",
  "input",
  "ins",
  "kbd",
  "label",
  "map",
  "mark",
  "meter",
  "noscript",
  "object",
  "output",
  "picture",
  "progress",
  "q",
  "ruby",
  "s",
  "samp",
  "script",
  "select",
  "slot",
  "small",
  "span",
  "strong",
  "sub",
  "sup",
  "svg",
  "template",
  "textarea",
  "time",
  "u",
  "tt",
  "var",
  "video",
  "wbr"
];

const content = `
<div>
  By March
  <div>
    <h2 class="authorh2">John Smith</h2>
    <div>line1</div>line2
         line3
    <ul>
      <li>test</li>
      <li>test2</li>
      <li>test3</li>
    </ul>
  </div>
</div>
`;

const isInline = (element) => INLINE_ELEMENTS.includes(element.name);
const isBlock = (element) => isInline(element) === false;
const walkTree = (node, callback, index = 0, level = 0) => {
  callback(node, index, level);
  for (let i = 0; i < (node.children || []).length; i++) {
    walkTree(node.children[i], callback, i, ++level);
    level--;
  }
};

const docFragText = [];
const cheerioFn = cheerio.load(content);
const docFrag = cheerioFn("body")[0];

walkTree(docFrag, (element) => {
  if (element.name === "body") {
    return;
  }

  if (element.type === NODE_TYPES.TEXT) {
    const parentElement = element.parent || {};
    const previousElement = element.prev || {};

    let textContent = element.data
      .split("\n")
      .map((nodeText, index) => (/\w/.test(nodeText) ? nodeText + "\n" : ""))
      .join("");

    if (textContent) {
      if (isInline(parentElement) || isBlock(previousElement)) {
        textContent = `${textContent}`;
      } else {
        textContent = `\n${textContent}`;
      }
      docFragText.push(textContent);
    }
  }
});

console.log(docFragText.join(""));

- first last

0

现有的答案使用正则表达式或其他库，但都不是必需的。在 Cheerio 中处理文本节点的技巧是使用 .contents()：

const cheerio = require("cheerio"); // 1.0.0-rc.12

const html = `
<div>
  By<div><h2 class="authorh2">John Smith</h2></div>
</div>`;

const $ = cheerio.load(html);
console.log($("div").contents().first().text().trim()); // => By

如果你不能确定文本节点始终是第一个子节点，你可以按照以下方式获取所有子节点中的第一个文本节点：

const text = $(
  [...$("div").contents()].find(e => e.type === "text")
)
  .text()
  .trim();
console.log(text); // => By

希望不用说，但"John Smith"部分是标准的Cheerio。

const name = $("div").find("h2").text().trim();
console.log(name); // => John Smith

另请参阅：

- ggorlen

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Reza Saadati · Accepted Answer

你可以使用以下正则表达式将所有HTML标签替换为一个空格：

/<\/?[a-zA-Z0-9=" ]*>/g

因此，当您使用此正则表达式替换HTML时，可能会产生多个空格。在这种情况下，您可以使用replace(/\s\s+/g, ' ')将所有空格替换为单个空格。

查看结果：

console.log(document.querySelector('div').innerHTML.replaceAll(/<\/?[a-zA-Z0-9=" ]*>/g, ' ').replace(/\s\s+/g, ' ').trim())

<div>
  By<div><h2 class="authorh2">John Smith</h2></div>
</div>