Lucene术语计数的最快方法是什么？

Question

Lucene术语计数的最快方法是什么？

6

我希望能够统计Lucene中某个字段上术语的文档数量。我知道3种方法做到这一点，我想知道最好和最快的实践方法是什么：

我将在一个长类型单值字段("field")中搜索该术语(所以不是文本，而是数字化数据！)

以下任何示例代码都将首先使用：

Directory dirIndex = FSDirectory.open('/path/to/index/');
IndexReader indexReader = DirectoryReader.open(dirIndex);
final BytesRefBuilder bytes = new BytesRefBuilder(); 
NumericUtils.longToPrefixCoded(Long.valueOf(longTerm).longValue(),0,bytes);

1) 使用 Index 中的 docFreq()

TermsEnum termEnum = MultiFields.getTerms(indexReader, "field").iterator(null);
termEnum.seekExact(bytes.toBytesRef());
int count = termEnum.docFreq();

2) 搜索它

IndexSearcher searcher = new IndexSearcher(indexReader);
TermQuery query = new TermQuery(new Term("field", bytes.toBytesRef()));
TotalHitCountCollector collector = new TotalHitCountCollector();
searcher.search(query,collector);
int count = collector.getTotalHits();

3）从索引中读取精确匹配的内容，并逐个计算文档数

TermsEnum termEnum = MultiFields.getTerms(indexReader, "field").iterator(null);
termEnum.seekExact(bytes.toBytesRef());
Bits liveDocs = MultiFields.getLiveDocs(indexReader);
DocsEnum docsEnum = termEnum.docs(liveDocs, null);
int count = 0;
if (docsEnum != null) {
    int docx;
    while ((docx = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
        count++;
    }
 }

最佳方法

选项1）代码最短，但是如果您更新和删除索引中的文档，它基本上是无用的。它将已删除的文档计算为仍在那里。没有在很多地方记录（除了官方文档外，在这里的答案中没有）。需要注意的事项。也许有一种方法可以避免这种情况，否则对此方法的热情有点过热。选项2）和3）产生正确的结果，但哪个更好？或者更好的是 - 有没有更快的方法来做到这一点？

- Jeroen

1

你比较过2)和3)的性能吗？个人而言，我可能会选择2)，我认为那是“官方”的做法。 - Marko Topolnik

我担心Lucene的缓存机制会影响性能测量结果，这可能会很棘手。 - Jeroen

2

重点是要按照你计划使用的方式来精确测量它。 - Marko Topolnik

1

测量的难点在于：如果我首先使用选项2和选项3获取DocId 1的这些值，则系统可能已经将其缓存为选项2，因此选项3会使用缓存并显得更快。好的，所以我使用System.nanoTime()来同时测量选项2和3的开始和结束时间。平均来说，在100个案例中，选项3快了8倍！即使我对同一文档运行几次，第二次运行也会从缓存中受益。这是有道理的，读取数字精确值不需要任何标准化，因此搜索程序可能会创建不必要的开销。 - Jeroen

1

现在我们都学到了一些东西 :) 我建议用这个有价值的发现写下你自己的答案。如果可能的话，请包括你的测试代码，以便人们可以证实这种说法。 - Marko Topolnik

完成了 - 唯一的疑问是选项2）可能是对结果进行排名，这对于文档计数并不需要。 - Jeroen

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jeroen · Accepted Answer

通过测试发现，使用索引获取文档而不是搜索它们（即选项3而非选项2）似乎更快（平均值：选项3在100个文档示例中比选项2快8倍）。我还反转了测试，以确保先运行一个测试再运行另一个测试不会影响结果：实际上并没有影响。

因此，似乎搜索器在执行简单的文档计数时创建了相当多的开销，如果查找单个术语条目的文档计数，则在索引中进行查找是最快的。

用于测试的代码（使用 SOLR 索引中的前 100 条记录）：

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.Fields;
import org.apache.lucene.index.DocsEnum;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.MultiFields;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.BytesRefBuilder;
import org.apache.lucene.util.NumericUtils;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.search.DocIdSetIterator;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TotalHitCountCollector;
import org.apache.lucene.util.Bits;
import org.apache.lucene.index.MultiFields;

public class ReadLongTermReferenceCount {

    public static void main(String[] args) throws IOException {

        Directory dirIndex = FSDirectory.open('/path/to/index/');
        IndexReader indexReader = DirectoryReader.open(dirIndex);
        final BytesRefBuilder bytes = new BytesRefBuilder(); 


        TermsEnum termEnum = MultiFields.getTerms(indexReader, "field").iterator(null);

        IndexSearcher searcher = new IndexSearcher(indexReader);
        TotalHitCountCollector collector = new TotalHitCountCollector();

        Bits liveDocs = MultiFields.getLiveDocs(indexReader);
        final BytesRefBuilder bytes = new BytesRefBuilder(); // for reuse!
        int maxDoc = indexReader.maxDoc();
        int docsPassed = 0;
        for (int i=0; i<maxDoc; i++) {
            if (docsPassed==100) {
                break;
            }
            if (liveDocs != null && !liveDocs.get(i))
                continue;
            Document doc = indexReader.document(i);

            //get longTerm from this doc and convert to BytesRefBuilder
            String longTerm = doc.get("longTerm");
            NumericUtils.longToPrefixCoded(Long.valueOf(longTerm).longValue(),0,bytes);

            //time before the first test
            long time_start = System.nanoTime();

            //look in the "field" index for longTerm and count the number of documents
            int count = 0;
            termEnum.seekExact(bytes.toBytesRef());
            DocsEnum docsEnum = termEnum.docs(liveDocs, null);
            if (docsEnum != null) {
                int docx;
                while ((docx = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
                    count++;
                }
            }

            //mid point: test 1 done, start of test 2
            long time_mid = System.nanoTime();

            //do a search for longTerm in "field"
            TermQuery query = new TermQuery(new Term("field", bytes.toBytesRef()));
            searcher.search(query,collector);
            int count = collector.getTotalHits();

            //end point: test 2 done.
            long time_end = System.nanoTime();

            //write to stdout
            System.out.println(longTerm+"\t"+(time_mid-time_start)+"\t"+(time_end-time_mid));

            docsPassed++;
        }
        indexReader.close();
        dirIndex.close();
    }
}

稍作修改即可与Lucene 5一起使用：

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.Fields;
import org.apache.lucene.index.PostingsEnum;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.TermsEnum;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.MultiFields;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.BytesRefBuilder;
import org.apache.lucene.util.NumericUtils;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.search.DocIdSetIterator;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TotalHitCountCollector;
import org.apache.lucene.util.Bits;
import org.apache.lucene.index.MultiFields;

public class ReadLongTermReferenceCount {

    public static void main(String[] args) throws IOException {

        Directory dirIndex = FSDirectory.open(Paths.get('/path/to/index/'));
        IndexReader indexReader = DirectoryReader.open(dirIndex);
        final BytesRefBuilder bytes = new BytesRefBuilder(); 


        TermsEnum termEnum = MultiFields.getTerms(indexReader, "field").iterator(null);

        IndexSearcher searcher = new IndexSearcher(indexReader);
        TotalHitCountCollector collector = new TotalHitCountCollector();

        Bits liveDocs = MultiFields.getLiveDocs(indexReader);
        final BytesRefBuilder bytes = new BytesRefBuilder(); // for reuse!
        int maxDoc = indexReader.maxDoc();
        int docsPassed = 0;
        for (int i=0; i<maxDoc; i++) {
            if (docsPassed==100) {
                break;
            }
            if (liveDocs != null && !liveDocs.get(i))
                continue;
            Document doc = indexReader.document(i);

            //get longTerm from this doc and convert to BytesRefBuilder
            String longTerm = doc.get("longTerm");
            NumericUtils.longToPrefixCoded(Long.valueOf(longTerm).longValue(),0,bytes);

            //time before the first test
            long time_start = System.nanoTime();

            //look in the "field" index for longTerm and count the number of documents
            int count = 0;
            termEnum.seekExact(bytes.toBytesRef());
            PostingsEnum docsEnum = termEnum.postings(liveDocs, null);
            if (docsEnum != null) {
                int docx;
                while ((docx = docsEnum.nextDoc()) != DocIdSetIterator.NO_MORE_DOCS) {
                    count++;
                }
            }

            //mid point: test 1 done, start of test 2
            long time_mid = System.nanoTime();

            //do a search for longTerm in "field"
            TermQuery query = new TermQuery(new Term("field", bytes.toBytesRef()));
            searcher.search(query,collector);
            int count = collector.getTotalHits();

            //end point: test 2 done.
            long time_end = System.nanoTime();

            //write to stdout
            System.out.println(longTerm+"\t"+(time_mid-time_start)+"\t"+(time_end-time_mid));

            docsPassed++;
        }
        indexReader.close();
        dirIndex.close();
    }
}