Java Lucene NGramTokenizer

Question

Java Lucene NGramTokenizer

13

我正在尝试将字符串分词成ngrams。奇怪的是，在NGramTokenizer的文档中，我没有看到一个返回被分词的单个ngram的方法。实际上，在NGramTokenizer类中，我只看到两种返回字符串对象的方法。

这里是我的代码：

Reader reader = new StringReader("This is a test string");
NGramTokenizer gramTokenizer = new NGramTokenizer(reader, 1, 3);

哪里可以找到被分词的ngrams？
如何将输出转换成字符串/单词？

我希望我的输出是这样的：This, is, a, test, string, This is, is a, a test, test string, This is a, is a test, a test string。

- CodeKingPlusPlus

4个回答

1

对于最近版本的Lucene（4.2.1），这是一段可行的干净代码。在执行此代码之前，您需要导入2个jar文件：

lucene-core-4.2.1.jar
lucene-analuzers-common-4.2.1.jar

在http://www.apache.org/dyn/closer.cgi/lucene/java/4.2.1找到这些文件。

//LUCENE 4.2.1
Reader reader = new StringReader("This is a test string");      
NGramTokenizer gramTokenizer = new NGramTokenizer(reader, 1, 3);

CharTermAttribute charTermAttribute = gramTokenizer.addAttribute(CharTermAttribute.class);

while (gramTokenizer.incrementToken()) {
    String token = charTermAttribute.toString();
    System.out.println(token);
}

- Amir

0

没有创建测试程序的情况下，我猜incrementToken()返回下一个标记，这将是ngrams之一。

例如，使用长度为1-3的ngram和字符串'a b c d'，NGramTokenizer可以返回：

a
a b
a b c
b
b c
b c d
c
c d
d

其中'a'、'a b'等是生成的ngram。

[编辑]

您可能还想查看不需要索引就能查询lucene令牌，因为它讨论了如何查看令牌流。

- Mark Leighton Fisher

0

package ngramalgoimpl;
import java.util.*;

public class ngr {

    public static List<String> n_grams(int n, String str) {
        List<String> n_grams = new ArrayList<String>();
        String[] words = str.split(" ");
        for (int i = 0; i < words.length - n + 1; i++)
            n_grams.add(concatination(words, i, i+n));
        return n_grams;
    }
     /*stringBuilder is used to cancatinate mutable sequence of characters*/
    public static String concatination(String[] words, int start, int end) {
        StringBuilder sb = new StringBuilder();
        for (int i = start; i < end; i++)
            sb.append((i > start ? " " : "") + words[i]);
        return sb.toString();
    }

    public static void main(String[] args) {
        for (int n = 1; n <= 3; n++) {
            for (String ngram : n_grams(n, "This is my car."))
                System.out.println(ngram);
            System.out.println();
        }
    }
}

- Shanmukh Borole

请提供上下文，这段代码是做什么的，它如何回答问题？抱歉，我无法回答此问题，因为您没有提供任何代码或问题的上下文。请提供更多信息以便我能够帮助您进行翻译。 - Kevin Kloet

@KevinKloet 请查看问题和给出的答案。 - Shanmukh Borole

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- femtoRgon · Accepted Answer

我不认为你会在寻找返回字符串的方法中找到你想要的东西。你需要处理属性。

应该像这样工作：

Reader reader = new StringReader("This is a test string");
NGramTokenizer gramTokenizer = new NGramTokenizer(reader, 1, 3);
CharTermAttribute charTermAttribute = gramTokenizer.addAttribute(CharTermAttribute.class);
gramTokenizer.reset();

while (gramTokenizer.incrementToken()) {
    String token = charTermAttribute.toString();
    //Do something
}
gramTokenizer.end();
gramTokenizer.close();

如果需要重复使用 Tokenizer，请确保在此之后重置()它。

按照注释，对单词组进行分词，而不是字符。

Reader reader = new StringReader("This is a test string");
TokenStream tokenizer = new StandardTokenizer(Version.LUCENE_36, reader);
tokenizer = new ShingleFilter(tokenizer, 1, 3);
CharTermAttribute charTermAttribute = tokenizer.addAttribute(CharTermAttribute.class);

while (tokenizer.incrementToken()) {
    String token = charTermAttribute.toString();
    //Do something
}