Java字符串模式识别

6
我有一个大约1000个字符长的字符串,由L、T和A组成。我相信其中存在一个简单模式,想知道是否有快速且简单的方法找到它。这个字符串会改变,所以不仅仅是一次性的。
我要寻找的模式例如,如果该字符串是:
LLLLLLLLAATAALLLLALLLLLLAATAALLLATLLLLLAATAALLAALLLLLAATAALL

这个字符串中,子串LLLLLAATAALL重复出现了4次。我想查找类似于这样的子串,但是我不知道它们在主字符串中的起始位置、结束位置、数量和长度。

如果有Java工具可以寻找这种类型的内容,任何建议都将不胜感激。

nt


1
当你不知道模式但想要找到它时,这对我来说是一种人工智能问题。我认为对于一个有限的字符串,你可以找到很多模式,唯一的问题是在下一个字符串中有多少个是正确的。 - vodkhang
这基本上是一个标准的计算机科学问题。请参阅http://en.wikipedia.org/wiki/Longest_repeated_substring_problem和http://www.cs.duke.edu/csed/algoprobs/substring.html。 - joschi
这个问题虽然简单,但真的很让我感兴趣!你拦截到了什么样的外星人传输?(-; 你只是在寻找重复的子字符串吗?我还可以想象其他更复杂的模式。无论如何,我现在正在研究解决方案,如果我找到任何有用的东西,我会发布答案。 - Adriaan Koster
@Adrian 哈哈,不幸地没有外星人。我正在考虑几种在 2D 图形文件中搜索重复模式的不同方法,这是我的第一次尝试。 - nite
2个回答

4

好的,我从这里获取了代码,并对其进行了修改,以便跟踪和打印最长重复子字符串。只需使用JUnit运行它即可。

/* Copyright (c) 2010 the authors listed at the following URL, and/or
 the authors of referenced articles or incorporated external code:
 http://en.literateprograms.org/Suffix_tree_(Java)?action=history&offset=20100123220137

 Permission is hereby granted, free of charge, to any person obtaining
 a copy of this software and associated documentation files (the
 "Software"), to deal in the Software without restriction, including
 without limitation the rights to use, copy, modify, merge, publish,
 distribute, sublicense, and/or sell copies of the Software, and to
 permit persons to whom the Software is furnished to do so, subject to
 the following conditions:

 The above copyright notice and this permission notice shall be
 included in all copies or substantial portions of the Software.

 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
 EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
 MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
 IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
 CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
 TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
 SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

 Retrieved from: http://en.literateprograms.org/Suffix_tree_(Java)?oldid=16641
 */

import java.util.ArrayList;
import java.util.Collection;
import java.util.List;
import org.junit.Test;

public class SuffixTree {
    @Test
    public void sampleUsage() {

        AbstractSuffixTree tree = new SimpleSuffixTree(
                "LLLLLLLLAATAALLLLALLLLLLAATAALLLATLLLLLAATAALLAALLLLLAATAALL");
        System.out.println("Longest repeating substring "
                + tree.best.printResult() + " repetitions=" + tree.best.visits
                + " length=" + tree.best.stringDepth);
    }
}

abstract class AbstractSuffixTree {

    SuffixTreeNode best;
    String text = null;
    SuffixTreeNode root = null;
    int inputAlphabetSize = -1;

    AbstractSuffixTree(String text) {
        if (text.length() > 0 && text.charAt(text.length() - 1) == '$') {
            this.text = text;
        } else {
            this.text = text + "$";
        }
    }

}

class SimpleSuffixTree extends AbstractSuffixTree {

    public SimpleSuffixTree(String text) {
        super(text);
        constructTree();
    }

    private void constructTree() {
        super.root = new SuffixTreeNode(this);
        best = root;
        char[] s = super.text.toCharArray();
        for (int i = 0; i < s.length; i++) {
            List<String> suffixList = new ArrayList<String>();
            for (int k = i; k < s.length; k++) {
                suffixList.add(s[k] + "");
            }
            super.root.addSuffix(suffixList, i + 1);
        }
    }

}

class CompactSuffixTree extends AbstractSuffixTree {

    public CompactSuffixTree(SimpleSuffixTree simpleSuffixTree) {
        super(simpleSuffixTree.text);
        super.root = compactNodes(simpleSuffixTree.root, 0);
        super.best = simpleSuffixTree.best;
    }

    private SuffixTreeNode compactNodes(SuffixTreeNode node, int nodeDepth) {
        node.nodeDepth = nodeDepth;
        for (SuffixTreeNode child : node.children) {
            while (child.children.size() == 1) {
                SuffixTreeNode grandchild = child.children.iterator().next();
                child.incomingEdge.label += ", "
                        + grandchild.incomingEdge.label;
                child.stringDepth += grandchild.incomingEdge.label.length();
                child.children = grandchild.children;
                // for (SuffixTreeNode grandchild : child.children)
                grandchild.parent = node;
            }
            child = compactNodes(child, nodeDepth + 1);
        }
        return node;
    }

}

class SuffixTreeNode {

    AbstractSuffixTree tree;
    SuffixTreeEdge incomingEdge = null;
    int nodeDepth = -1;
    int label = -1;
    Collection<SuffixTreeNode> children = null;
    SuffixTreeNode parent = null;
    int stringDepth;
    int id = 0;
    public static int c;
    public int visits = 1;

    public SuffixTreeNode(AbstractSuffixTree tree, SuffixTreeNode parent,
            String incomingLabel, int depth, int label, int id) {
        children = new ArrayList<SuffixTreeNode>();
        incomingEdge = new SuffixTreeEdge(incomingLabel, label);
        nodeDepth = depth;
        this.label = label;
        this.parent = parent;
        stringDepth = parent.stringDepth + incomingLabel.length();
        this.id = id;
        this.tree = tree;
    }

    public SuffixTreeNode(AbstractSuffixTree tree) {
        children = new ArrayList<SuffixTreeNode>();
        nodeDepth = 0;
        label = 0;
        this.tree = tree;
    }

    public void addSuffix(List<String> suffix, int pathIndex) {
        SuffixTreeNode insertAt = this;
        insertAt = search(this, suffix);
        insert(insertAt, suffix, pathIndex);
    }

    private SuffixTreeNode search(SuffixTreeNode startNode, List<String> suffix) {
        if (suffix.isEmpty()) {
            throw new IllegalArgumentException(
                    "Empty suffix. Probably no valid simple suffix tree exists for the input.");
        }
        Collection<SuffixTreeNode> children = startNode.children;
        for (SuffixTreeNode child : children) {
            if (child.incomingEdge.label.equals(suffix.get(0))) {
                suffix.remove(0);
                child.visits++;
                if (child.visits > 1
                        && child.stringDepth > tree.best.stringDepth) {
                    tree.best = child;
                }
                if (suffix.isEmpty()) {
                    return child;
                }
                return search(child, suffix);
            }
        }
        return startNode;
    }

    private void insert(SuffixTreeNode insertAt, List<String> suffix,
            int pathIndex) {
        for (String x : suffix) {
            SuffixTreeNode child = new SuffixTreeNode(tree, insertAt, x,
                    insertAt.nodeDepth + 1, pathIndex, id);
            insertAt.children.add(child);
            insertAt = child;
        }
    }

    public String toString() {
        StringBuilder result = new StringBuilder();
        String incomingLabel = this.isRoot() ? "" : this.incomingEdge.label;
        for (int i = 1; i <= this.nodeDepth; i++)
            result.append("\t");
        if (this.isRoot()) {
            c = 1;
            this.id = 1;
        } else {
            this.id = c;
            result.append(this.parent.id + " -> ");
            result.append(this.id + "[label=\"" + incomingLabel + "\"]" + "("
                    + visits + "," + (stringDepth) + ")" + ";\n");
        }
        for (SuffixTreeNode child : children) {
            c++;
            child.id = c;
            result.append(child.toString());
        }
        return result.toString();
    }

    public String printResult() {
        if (parent == null) {
            return "";
        } else {
            return this.parent.printResult() + this.incomingEdge.label;
        }
    }

    public boolean isRoot() {
        return this.parent == null;
    }

    public boolean isLeaf() {
        return children.size() == 0;
    }

}

class SuffixTreeEdge {

    String label = null;
    @SuppressWarnings("unused")
    private int branchIndex = -1;

    public SuffixTreeEdge(String label, int branchIndex) {
        this.label = label;
        this.branchIndex = branchIndex;
    }

}

输出:

最长重复子串 LLLLLLAATAALLL 重复次数=2 长度=14

编辑:回答你的评论。

目前我只是跟踪最长重复的SuffixTreeNode(它是AbstractSuffixTree中的一个字段)。您可以修改它,使其跟踪按其stringDepth排序的节点SortedQueue。


非常感谢。这几乎做到了我想要的。唯一的问题是,我不是在寻找最长重复子字符串。当我在一个实际的测试字符串上使用它时,该函数返回了类似于“LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL”的结果,而该字符串有1000个字符。 - nite
1
可能我正在寻找的字符串是最长的,也可能不是。我还没有深入研究您提供的代码,但是否有一种方法可以查找第二长、第三长等? - nite

-1
你可以从以下代码中了解如何计算字符串“hi”的数量: public int countHi(String str) {
int count = 0, i = str.length() - 1; String goal = "hi";
for (int j = 0; j < i; j++) { if (str.substring(j, j + 2).equals(goal)) count++; } return count; }

2
嗨Jinjavajin,我知道如何在字符串中搜索子字符串。问题是我不知道我要搜索什么子字符串。因此,像指定“目标”这样的东西只有在我知道模式是什么之后才能起作用。 - nite
2
你的代码片段展示了可怕的暴力破解,当你有成千上万个字符和未知目标时,这并不是一个好主意。 - Gaim

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接