类似DOM Range操作中extractContents()的jsoup等效方法是什么?

3

我正在尝试使用jsoup DOM模型来提取和替换JavaScript DocumentFragments的等价物。

是否有一些现成的代码可以模拟DOM范围选择及其操作?我想选择一段文本范围,可能会经过多个内联节点(例如<a>、<span>等),起始点或结束点也可能处于这些内联节点的中间位置。在JavaScript中,使用Range操作很容易完成,从中提取DocumentFragment并对其进行包裹等操作。我猜JavaScript Range会根据需要分割内部节点,以正确处理此类提取和插入。那么,在Java中,我该如何使用jsoup实现这一功能呢?

编辑:只是在思考如何做到这一点——可能需要在我的范围内查找“峰值”元素,然后转到范围的起始点和结束点,并通过跳到父级将它们“提升”到“峰值层次”,如果我的起始点是第一个子元素,否则就在范围开始元素之前分割元素子列表... 如果有这样的现成代码,我宁愿重用它,否则将不得不从头开始编写。

更新于2015年12月18日:我发布了我的答案和我开发的可工作代码,请见下文。

2个回答

1

这里是我承诺的代码,可以将任意范围的DOM body包装成任意的html标记,以便于提取、移动、替换、复制/粘贴等操作。

更新时间:2015年12月19日 通过wrapRange()方法变体,在文本中添加TextNode分割,可选偏移量指定范围应该从文本节点的哪个位置开始或结束。现在可以在jsoup DOM模型内进行任意复制/粘贴/移动。

待办事项:(对自己或其他好心人)

  • 编写一个演示此功能的示例项目,加上一些测试用例,并发布到GitHub。现在没时间做这件事,但似乎在我的应用程序中运行良好(处理来自网页和电子书的HTML代码,以便使用TTS朗读 - 参见@Voice Aloud Reader app in Google Play

RangeWrapper.java模块:

import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.parser.Tag;

import java.util.ArrayList;

/**
 * Created by greg on 12/18/2015.
 */
public class RangeWrapper {
    /**
     * Wrap the supplied HTML around the "range" from startEl to endEl.*
     * @param startEl the first element to be included into the range
     * @param endEl the last element to be included into the range
     * @param html HTML to wrap around this element, e.g. 
     * {@code <span class="head"></span>}. Can be arbitrarily deep.
     * @return the wrapping element
     */
    public static Element wrapRange(Node startEl, Node endEl, String html) {
        if (startEl == endEl) { // special case
            return (Element) startEl.wrap(html).parentNode();
        }
        int startDepth = NodeWalker.getNodeDepth(startEl);
        int endDepth = NodeWalker.getNodeDepth(endEl);
        int minDepth = getRangeMinDepth(startEl, endEl);
        int n;
        while (startDepth > minDepth) {
            Element parent = (Element)startEl.parentNode();
            if ((n = startEl.siblingIndex()) > 0) {
                // splitting the parent
                ArrayList<Node> children = new ArrayList<Node>(parent.childNodes());
                Element parent2 = new Element(Tag.valueOf(parent.tagName()), parent.baseUri(), parent.attributes());
                parent.after(parent2);
                for (int i = n; i < children.size(); i++)
                    parent2.appendChild(children.get(i));
                startEl = parent2;
            } else {
                startEl = parent;
            }
            startDepth--;
        }
        while (endDepth > minDepth) {
            Element parent = (Element)endEl.parentNode();
            if ((n = endEl.siblingIndex()) < parent.children().size()-1) {
                // splitting the parent
                ArrayList<Node> children = new ArrayList<Node>(parent.childNodes());
                Element parent2 = new Element(Tag.valueOf(parent.tagName()), parent.baseUri(), parent.attributes());
                parent.before(parent2);
                for (int i = 0; i <= n; i++)
                    parent2.appendChild(children.get(i));
                endEl = parent2;
            } else {
                endEl = parent;
            }
            endDepth--;
        }
        // Now startEl and endEl are on the same depth == minDepth. 
        // Wrap the range with our html string
        Element range = (Element) startEl.wrap(html).parentNode();
        Node nextToAppend;
        do {
            nextToAppend = range.nextSibling();
            // If nextToAppend is null, something is really wrong...
            // Commented out to let it crash and investigate,
            // so far it did not happen.
            //if (nextToAppend == null)
            //    break;
            range.appendChild(nextToAppend);
        } while (nextToAppend != endEl);

        return range;
    }

    /**
     * Wrap the supplied HTML around the "range" from startEl to endEl.*
     * @param startEl the first element to be included into the range
     * @param stOffset if startEl is TextNode, split at this offset
     *                   and include only the tail. Otherwise ignored.
     * @param endEl the last element to be included into the range
     * @param endOffset if endEl is a Text node, split at this offset
     *                    and include only the head. Otherwise ignored.
     * @param html HTML to wrap around this element, e.g. {@code <span class="head"></span>}. Can be arbitrarily deep.
     * @return the wrapping element
     */
    public static Element wrapRange(Node startEl, int stOffset, Node endEl, int endOffset, String html) {
        if (stOffset > 0 && startEl instanceof TextNode) {
            TextNode tn = (TextNode) startEl;
            if (endOffset < tn.getWholeText().length()-1) {
                startEl = tn.splitText(stOffset); // Splits tn and adds tail to DOM, returns tail
            }
        }
        if (endOffset > 0 && endEl instanceof TextNode) {
            TextNode tn = (TextNode) endEl;
            if (endOffset < tn.getWholeText().length()-1) {
                tn.splitText(stOffset); // Splits tn and adds tail to DOM, we take head == original endEl
            }
        }

        return wrapRange(startEl, endEl, html);
    }


    /**
     * Calculate the depth of the range between the two given nodes, relative to body.
     * The body has depth 0.
     * @param startNode the first element to be included into the range
     * @param endNode the last element to be included into the range
     * @return minimum depth found in the range
     */
    public static int getRangeMinDepth(final Node startNode, final Node endNode) {
        class DepthVisitor implements NodeWalker.NodeWalkVisitor {
            private int _minDepth = Integer.MAX_VALUE;
            public boolean head(Node node, int depth) {
                if (depth < _minDepth)
                    _minDepth = depth;
                return true;
            }
            public boolean tail(Node node, int depth) {return true;}
            int getMinDepth() { return _minDepth; }
        };
        DepthVisitor visitor = new DepthVisitor();
        NodeWalker nw = new NodeWalker(visitor);
        nw.walk(startNode, endNode);
        return visitor.getMinDepth();
    }
}

...上述代码使用的NodeWalker.java,是从jsoup包中的NodeTraversor和NodeVisitor类进行调整而来:

import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import org.jsoup.select.NodeVisitor;

/**
 * Depth-first node traversor. Use to iterate through all nodes under and including the specified root node.
 * <p>
 * This implementation does not use recursion, so a deep DOM does not risk blowing the stack.
 * </p>
 */
public class NodeWalker {
    private NodeWalkVisitor visitor;

    /**
     * Create a new traversor.
     * @param visitor a class implementing the {@link NodeVisitor} interface, to be called when visiting each node.
     */
    public NodeWalker(NodeWalkVisitor visitor) {
        this.visitor = visitor;
    }

    /**
     * Start a depth-first traverse of the whole body and all of its descendants.
     * @param startNode the arbitrary start point node point within body to traverse from.
     * @param endNode the arbitrary end point node point within body where we stop traverse.
     *                Can be null, in which case we walk until the end of the body.
     */
    public void walk(Node startNode, Node endNode) {
        Node node = startNode;
        int depth = getNodeDepth(startNode); // let's calulate depth relative to body, body is depth 0

        while (node != null) {
            if (!visitor.head(node, depth))
                break;
            if (node.childNodeSize() > 0) {
                node = node.childNode(0);
                depth++;
            } else {
                while (node.nextSibling() == null && depth > 0) {
                    if (!visitor.tail(node, depth) || node == endNode)
                        return;
                    node = node.parentNode();
                    depth--;
                }
                if (!visitor.tail(node, depth) || node == endNode)
                    break;
                node = node.nextSibling();
            }
        }
    }

// The walkBack() was not needed, but leaving it here, may be useful for something...
//    /**
//     * Start a depth-first backward traverse of the whole body and all of its descendants.
//     * @param startNode the arbitrary start point node point within body to traverse from.
//     * @param endNode the arbitrary end point node point within body where we stop traverse.
//     *                Can be null, in which case we walk until the end of the body.
//     */
//    public void walkBack(Node startNode, Node endNode) {
//        Node node = startNode;
//        int depth = getNodeDepth(startNode); // let's calulate depth relative to body, body is depth 0
//
//        while (node != null) {
//            if (!visitor.tail(node, depth))
//                break;
//            if (node.childNodeSize() > 0) {
//                node = node.childNode(node.childNodeSize() - 1);
//                depth++;
//            } else {
//                while (node.previousSibling() == null && depth > 0) {
//                    if (!visitor.head(node, depth) || node == endNode)
//                        return;
//                    node = node.parentNode();
//                    depth--;
//                }
//                if (!visitor.head(node, depth) || node == endNode)
//                    break;
//                node = node.previousSibling();
//            }
//        }
//    }

    /**
     * Calculate the depth of the given node relative to body. The body has depth 0.
     * @param givenNode the node within the body to calculate depth for.
     * @return the depth of the givenNode
     */
    public static int getNodeDepth(Node givenNode) {
        Node node = givenNode;
        int depth = 0; // let's calulate depth relative to body, body is depth 0
        if (!(node instanceof Element) || !"body".equals(((Element) node).tagName())) {
            do {
                depth++;
                node = (Element)node.parentNode();
            } while (node != null && !"body".equals(((Element) node).tagName()));
        }
        return depth;
    }

    public interface NodeWalkVisitor {
        /**
         * Callback for when a node is first visited.
         *
         * @param node the node being visited.
         * @param depth the depth of the node, relative to the root node. E.g., the root node has depth 0, and a child node
         * of that will have depth 1.
         * @return true to continue walk, false to abort
         */
        boolean head(Node node, int depth);

        /**
         * Callback for when a node is last visited, after all of its descendants have been visited.
         *
         * @param node the node being visited.
         * @param depth the depth of the node, relative to the root node. E.g., the root node has depth 0, and a child node
         * of that will have depth 1.
         * @return true to continue walk, false to abort
         */
        boolean tail(Node node, int depth);
    }
}

Greg


1

两个要点:

  • JSoup提供了一些操作文本节点的方法,返回String对象。
  • Java及其生态系统提供了强大的API来操作String对象。

在从头开始编写DOM Range操作之前,您可以尝试使用上述两种选项。

以下是JSoup API中的一些方法:


  • Element#text() 获取该元素的合并未编码文本作为字符串。
    API摘录:

    给定HTML <p>Hello <b>there</b> now! </p>,p.text()返回"Hello there now!"



  • Element#ownText 获取当前元素的未编码文本,不包括所有子元素的文本。
    API摘录:

    例如,给定HTML <p>Hello <b>there</b> now!</p>,p.ownText() 返回 "Hello now!",而 p.text() 返回 "Hello there now!"。请注意,b元素内的文本不会返回,因为它不是p元素的直接子元素。


您可能也会发现以下两个代码片段很有用:


感谢您的帮助!是的,我已经在我的代码的其他部分中使用了大多数这些函数,但仍在尝试弄清如何执行范围操作 - 提取和替换 JavaScript DocumentFragments 的 Java 等效项。 - gregko
1
好答案!结构清晰,信息丰富。+1 :) - luksch
@gregko,你能否在你的帖子中添加你尝试过的“提取和替换JavaScript DocumentFragments等效项”? - Stephan
@Stephan - 谢谢,编辑并在帖子开头添加了您的词语,另外添加了“documentfragment”作为另一个关键字。 - gregko

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接