遍历DOM树

Question

遍历DOM树

9

作为大部分（全部？）进行HTML净化的PHP库，如HTML Purifier都严重依赖于正则表达式，我认为尝试编写一个使用DOMDocument和相关类的HTML净化器将是一项值得尝试的实验。虽然我目前处于非常早期的阶段，但该项目已经显示出了一些潜力。

我的想法围绕着一个类，它使用DOMDocument来遍历提供的标记中的所有节点，将它们与白名单进行比较，并删除不在白名单上的任何内容。（第一次实现非常基础，仅根据节点类型删除节点，但我希望在未来更加复杂并分析节点的属性，链接地址是否指向不同域等等）。

我的问题是如何遍历DOM树？据我所知，DOM*对象具有childNodes属性，那么我需要递归整个树吗？此外，对DOMNodeLists的早期实验表明，您需要非常小心地删除顺序，否则可能会留下物品或触发异常。

如果有人有PHP中操作DOM树的经验，我会很感激您对此主题提供的任何反馈。

编辑：我已经为我的HTML清理类构建了以下方法。它递归遍历DOM树并检查找到的元素是否在白名单上。如果它们不在白名单上，则将其删除。

我遇到的问题是，如果您删除一个节点，则DOMNodeList中所有后续节点的索引都会更改。从底部向上简单地工作可以避免这个问题。目前它仍然是一种非常基本的方法，但我认为它显示出了潜力。它肯定比HTMLPurifier要快得多，尽管Purifier做了很多其他的东西。

/**
 * Recursivly remove elements from the DOM that aren't whitelisted
 * @param DOMNode $elem
 * @return array List of elements removed from the DOM
 * @throws Exception If removal of a node failed than an exception is thrown
 */
private function cleanNodes (DOMNode $elem)
{
    $removed    = array ();
    if (in_array ($elem -> nodeName, $this -> whiteList))
    {
        if ($elem -> hasChildNodes ())
        {
            /*
             * Iterate over the element's children. The reason we go backwards is because
             * going forwards will cause indexes to change when elements get removed
             */
            $children   = $elem -> childNodes;
            $index      = $children -> length;
            while (--$index >= 0)
            {
                $removed = array_merge ($removed, $this -> cleanNodes ($children -> item ($index)));
            }
        }
    }
    else
    {
        // The element is not on the whitelist, so remove it
        if ($elem -> parentNode -> removeChild ($elem))
        {
            $removed [] = $elem;
        }
        else
        {
            throw new Exception ('Failed to remove node from DOM');
        }
    }
    return ($removed);
}

- GordonM

1

不要那样做。不要重复发明轮子。重复利用已经存在的软件。 - dynamic

3

现有的软件，例如HTMLPurifier，速度非常慢并且基于正则表达式。我正在做这件事部分原因是想看看是否有更好的方法，部分原因是希望利用这个练习来学习。 - GordonM

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Gordon · Accepted Answer

首先，您可以查看此自定义RecursiveDomIterator：

https://github.com/salathe/spl-examples/wiki/RecursiveDOMIterator

代码：

class RecursiveDOMIterator implements RecursiveIterator
{
    /**
     * Current Position in DOMNodeList
     * @var Integer
     */
    protected $_position;

    /**
     * The DOMNodeList with all children to iterate over
     * @var DOMNodeList
     */
    protected $_nodeList;

    /**
     * @param DOMNode $domNode
     * @return void
     */
    public function __construct(DOMNode $domNode)
    {
        $this->_position = 0;
        $this->_nodeList = $domNode->childNodes;
    }

    /**
     * Returns the current DOMNode
     * @return DOMNode
     */
    public function current()
    {
        return $this->_nodeList->item($this->_position);
    }

    /**
     * Returns an iterator for the current iterator entry
     * @return RecursiveDOMIterator
     */
    public function getChildren()
    {
        return new self($this->current());
    }

    /**
     * Returns if an iterator can be created for the current entry.
     * @return Boolean
     */
    public function hasChildren()
    {
        return $this->current()->hasChildNodes();
    }

    /**
     * Returns the current position
     * @return Integer
     */
    public function key()
    {
        return $this->_position;
    }

    /**
     * Moves the current position to the next element.
     * @return void
     */
    public function next()
    {
        $this->_position++;
    }

    /**
     * Rewind the Iterator to the first element
     * @return void
     */
    public function rewind()
    {
        $this->_position = 0;
    }

    /**
     * Checks if current position is valid
     * @return Boolean
     */
    public function valid()
    {
        return $this->_position < $this->_nodeList->length;
    }
}

您可以结合使用 RecursiveIteratorIterator。页面上有使用示例。

一般来说，使用XPath搜索黑名单节点比遍历DOM树更容易。此外，请注意，DOM已经非常擅长通过自动转义节点值中的xml实体来防止XSS攻击。

另一件需要注意的事情是，对DOMDocument的任何操作都会立即影响XPath查询中可能存在的任何DOMNodeList，并且这可能导致在操作它们时跳过节点。请参见DOMNode replacement with PHP's DOM classes以获取示例。