如何将所有标签中的文本放入一个数组？

Question

如何将所有标签中的文本放入一个数组？

4

我需要创建一个数组，其中包含页面上所有的文本，而不使用 jQuery。这是我的HTML代码：

<html>
<head>
    <title>Hello world!</title>
</head>
<body>
    <h1>Hello!</h1>
    <p>
        <div>What are you doing?</div>
        <div>Fine, and you?</div>
    </p>
    <a href="http://google.com">Thank you!</a>
</body>
</html>

这是我想要得到的：

text[1] = "Hello world!";
text[2] = "Hello!";
text[3] = "What are you doing?";
text[4] = "Fine, and you?";
text[5] = "Thank you!";

这是我尝试过的代码，但在我的浏览器中似乎不能正常工作：

var elements = document.getElementsByTagName('*');
console.log(elements);

PS. 我需要使用 document.getElementsByTagName('*'); 并排除 "script" 和 "style"。

- smotru

var full_text = ''; $(':not(#anything)').each(function(index, element) { var text = $(element).text(); full_text.push(text); } 变量full_text = ''; $（'：not（＃anything）'）。each（function（index，element）{ var text = $（element）.text（）; full_text.push（text）; } - Nick

1

似乎在Chrome和FF中都会返回所有元素：http://jsfiddle.net/7JJdn/ - Chad

谢谢您，但我不能使用jQuery.. - smotru

你正在使用哪个浏览器？ - Steve Allison

它在 jsFiddle 上运行正常，但在我的项目中不起作用..尝试在本地创建它。 - smotru

http://my.jetscreenshot.com/demo/20130718-iw1f-53kb.jpg - smotru

5个回答

2

如果您想要整个页面的内容，您应该能够使用

var allText = document.body.textContent;

在IE9之前的Internet Explorer中，存在与innerText类似但不完全相同的属性。有关textContent的更多详细信息，请访问MDN关于textContent的页面。

现在这里的一个问题是，textContent将获取任何<style>或<script>标记的内容，这可能与您想要的不同。如果您不需要这样的内容，可以使用以下内容：

function getText(startingPoint) {
  var text = "";
  function gt(start) {
    if (start.nodeType === 3)
      text += start.nodeValue;
    else if (start.nodeType === 1)
      if (start.tagName != "SCRIPT" && start.tagName != "STYLE")
        for (var i = 0; i < start.childNodes.length; ++i)
          gt(start.childNodes[i]);
  }
  gt(startingPoint);
  return text;
}

然后：

var allText = getText(document.body);

注意：这个方法（或者document.body.innerText）可以得到所有文本，但是它是按照深度优先的顺序获取的。要以人类在页面呈现后实际看到的顺序获取页面上的所有文本，这是一个更加困难的问题，因为代码需要理解由CSS（等等）指定的布局的视觉效果（和视觉语义！）。

编辑 - 如果你想要将文本“存储到数组中”，我想基于节点逐个替换字符串连接，就可以实现了。

function getTextArray(startingPoint) {
  var text = [];
  function gt(start) {
    if (start.nodeType === 3)
      text.push(start.nodeValue);
    else if (start.nodeType === 1)
      if (start.tagName != "SCRIPT" && start.tagName != "STYLE")
        for (var i = 0; i < start.childNodes.length; ++i)
          gt(start.childNodes[i]);
  }
  gt(startingPoint);
  return text;
}

- Pointy

是的，它获取了所有文本（除标题外），但没有将其存储到数组中。 - smotru

如果你不知道如何将某个东西存储到数组中，那么你来错页面了。我相信可以通过查看其他答案来了解它们是如何存储的。 - iConnor

@Connor 我知道如何将东西存储到一个数组中，但是我不太理解 Pointy 的脚本... 一些注释应该会有所帮助.. - smotru

@smotru，脚本已更新 - 它只是对DOM进行递归遍历。 - Pointy

我看到了，感谢你的帮助，但它只能在document.body中工作。 - smotru

你的回答帮助我完成了我的工作，但是Connor的脚本更加干净，而且它对我起作用。 - smotru

1

似乎有一个一行解决方案（fiddle）：

document.body.innerHTML.replace(/^\s*<[^>]*>\s*|\s*<[^>]*>\s*$|>\s*</g,'').split(/<[^>]*>/g)

如果中有复杂的脚本，这可能会失败。我知道使用正则表达式解析HTML不是一个非常聪明的想法，但对于简单的情况或演示目的而言，它仍然可以适用，不是吗？:)

- Ilya Streltsyn

0

    <html>
    <head>
            <title>Hello world!</title>
    </head>
    <body>
            <h1>Hello!</h1>
            <p>
                    <div>What are you doing?</div>
                    <div>Fine, 
                        <span> and you? </span>
                    </div>
            </p>
            <a href="http://google.com">Thank you!</a>
            <script type="text/javascript">
                function getLeafNodesOfHTMLTree(root) {
                    if (root.nodeType == 3) {
                        return [root];
                    } else {
                        var all = [];
                        for (var i = 0; i < root.childNodes.length; i++) {
                            var ret2 = getLeafNodesOfHTMLTree(root.childNodes[i]);
                            all = all.concat(ret2);
                        }
                        return all;
                    }
                }
                var allnodes = getLeafNodesOfHTMLTree(document.getElementsByTagName("html")[0]);
                console.log(allnodes);
                 //in modern browsers that surport array filter and map
                allnodes = allnodes.filter(function (node) {
                    return node && node.nodeValue && node.nodeValue.replace(/\s/g, '').length;
                });
                allnodes = allnodes.map(function (node) {
                    return node.nodeValue
                })
                 console.log(allnodes);
            </script>
    </body>
    </html>

- jinwei

元素的“childNodes”仅是其直接子元素。 - Pointy

1

顺便提一下，P标签不能包含DIV标签，请参见https://dev59.com/iWoy5IYBdhLWcg3wq_3O。 - jinwei

我知道，这只是一个例子...我的脚本应该能够在未通过W3验证的页面上工作。 - smotru

0

遍历DOM树，获取所有文本节点，获取文本节点的nodeValue。

var result = [];
var itr = document.createTreeWalker(
    document.getElementsByTagName("html")[0],
    NodeFilter.SHOW_TEXT,
    null, // no filter
    false);
while(itr.nextNode()) {
    if(itr.currentNode.nodeValue != "")
        result.push(itr.currentNode.nodeValue);
}
alert(result);

另一种方法：根据HTML标签的textContent进行拆分。

var result = document.getElementsByTagName("html")[0].textContent.split("\n");
for(var i=0; i<result.length; i++)
    if(result[i] == "")
        result.splice(i, 1);
alert(result);

- Louis Ricci

@Connor - createTreeWalker在旧版本的IE中支持不稳定。 - Louis Ricci

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- iConnor · Accepted Answer

  var array = [];

    var elements = document.body.getElementsByTagName("*");

    for(var i = 0; i < elements.length; i++) {
       var current = elements[i];
        if(current.children.length === 0 && current.textContent.replace(/ |\n/g,'') !== '') {
           // Check the element has no children && that it is not empty
           array.push(current.textContent);
        }
    }

你可以像这样做：

演示

结果 = ["你在干什么？", "很好，你呢？"]

或者你可以使用document.documentElement.getElementsByTagName('*');

同时确保你的代码位于此标签内

document.addEventListener('DOMContentLoaded', function(){

   /// Code...
});

如果您只需要标题，您可以这样做

array.push(document.title);

避免循环脚本和样式表