从网页中提取文本内容

Question

从网页中提取文本内容

3

我需要从网页中提取所有文本内容。我已经使用了“document.body.textContent”的方法，但是我也获得了JavaScript内容。如何确保我只获取可读的文本内容？

function myFunction() {
  var str = document.body.textContent
  alert(str);
}

<html>
<title>Test Page for Text extraction</title>

<head>I hope this works</head>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js"></script>

<body>
  <p>Test on this content to change the 5th word to a link
    <p>
      <button onclick="myFunction()">Try it</button>
</body>
</hmtl>

- vjravi

2个回答

0

尝试使用document.body.innerText。

本 MDN 文章介绍了textContent和innerText之间的区别：

不要被Node.textContent和HTMLElement.innerText之间的差异所困扰。虽然名称相似，但存在重要差异：

textContent获取所有元素（包括<script>和<style>元素）的内容。相比之下，innerText仅显示“可读”的元素。

textContent返回节点中的每个元素。相比之下，innerText能够识别样式，并且不会返回“隐藏”元素的文本。

- GJW

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Patrick Evans · Accepted Answer

在进行body.textContent之前，只需删除您不想读取的标签即可。

function myFunction() {
  var bodyScripts = document.querySelectorAll("body script");
  for(var i=0; i<bodyScripts.length; i++){
      bodyScripts[i].remove();
  }
  var str = document.body.textContent;
  document.body.innerHTML = '<pre>'+str+'</pre>';
}

<html>
<title>Test Page for Text extraction</title>

<head>I hope this works</head>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.3/jquery.min.js"></script>

<body>
  <p>Test on this content to change the 5th word to a link
    <p>
      <button onclick="myFunction()">Try it</button>
</body>
</hmtl>