能否在Lucene索引中迭代存储的文档？

Question

能否在Lucene索引中迭代存储的文档？

lucenelucene.net

26

我有一些文档存储在Lucene索引中，其中包含docId字段。我想获取存储在该索引中的所有docIds。然而，存在一个问题，即文档数量约为300,000，因此我希望以500为大小的块来获取这些docIds。是否可以这样做？

- Eugeniu Torica

5个回答

19

Lucene 4

Bits liveDocs = MultiFields.getLiveDocs(reader);
for (int i=0; i<reader.maxDoc(); i++) {
    if (liveDocs != null && !liveDocs.get(i))
        continue;

    Document doc = reader.document(i);
}

详细信息请参见此页面上的LUCENE-2600：https://lucene.apache.org/core/4_0_0/MIGRATE.html

- bcoughlan

这个被另一个用户回滚了，但原始编辑是正确的，liveDocs可以为空。 - bcoughlan

1

现在（Lucene 8.x），这将是MultiBits.getLiveDocs(reader)。 - Vlad

8

有一个名为MatchAllDocsQuery的查询类，我认为它可以在这种情况下使用：

Query query = new MatchAllDocsQuery();
TopDocs topDocs = getIndexSearcher.search(query, RESULT_LIMIT);

- Chunliang Lyu

2

文档编号（或ID）将是从0到IndexReader.maxDoc()-1的连续数字。这些数字不是持久的，只在打开的IndexReader中有效。您可以使用IndexReader.isDeleted(int documentNumber)方法检查文档是否已删除。

- Yaroslav

0

如果您像上面的示例一样使用.document(i)并跳过已删除的文档，则在使用此方法进行分页结果时要小心。例如：您有一个每页10个文档的列表，并且您需要获取第6页的文档。您的输入可能类似于这样：offset=60，count=10（从60到70的文档）。

    IndexReader reader = // create IndexReader
for (int i=offset; i<offset + 10; i++) {
    if (reader.isDeleted(i))
        continue;

    Document doc = reader.document(i);
    String docId = doc.get("docId");
}

你会遇到一些问题，因为你不应该从offset=60开始，而是应该从offset=60加上在60之前出现的已删除文档数量。

我发现的另一种替代方法是这样的：

    is = getIndexSearcher(); //new IndexSearcher(indexReader)
    //get all results without any conditions attached. 
    Term term = new Term([[any mandatory field name]], "*");
    Query query = new WildcardQuery(term);

    topCollector = TopScoreDocCollector.create([[int max hits to get]], true);
    is.search(query, topCollector);

   TopDocs topDocs = topCollector.topDocs(offset, count);

注意：将 [[ ]] 中的文本替换为自己的值。在包含150万条目的大索引上运行此操作，不到一秒钟即可获得10个随机结果。使用 Agree 更慢，但如果需要分页，则至少可以忽略已删除的文档。

- andreyro

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- bajafresh4life · Accepted Answer

55

IndexReader reader = // create IndexReader
for (int i=0; i<reader.maxDoc(); i++) {
    if (reader.isDeleted(i))
        continue;

    Document doc = reader.document(i);
    String docId = doc.get("docId");

    // do something with docId here...
}

- bajafresh4life

没有isDeleted()检查，你会输出已经被删除的文档的id。 - bajafresh4life

为了完成上面的评论，当索引重新打开时，索引更改才会被提交，因此需要使用reader.isDeleted(i)来确保文档是有效的。 - Eugeniu Torica

1

@Jenea，在Java中检查文档是否已被删除的等效方法是什么？我正在寻找类似的功能...我不想考虑已经被删除的文档。 - Shankar

IndexReader.isDeleted()自2010年以来就已经消失了（Git更改集6a4bfc796fea6ed3474350adb271e06275d22e6a）。在Lucene 4.x中绝对不存在。 - Vlad