如何提高Lucene.net索引速度

Question

如何提高Lucene.net索引速度

c#performancelucenelucene.netfull-text-indexing

3

我正在使用lucene.net来索引我的pdf文件。它需要大约40分钟来索引15000个pdf文件，并且随着我的文件夹中pdf文件数量的增加，索引时间也会增加。

如何提高lucene.net的索引速度？
是否有其他具有快速索引性能的索引服务？

我正在使用最新版本的lucene.net索引（Lucene.net 3.0.3）。

以下是我的索引代码。

public void refreshIndexes() 
        {
            // Create Index Writer
            string strIndexDir = @"E:\LuceneTest\index";
            IndexWriter writer = new IndexWriter(Lucene.Net.Store.FSDirectory.Open(new System.IO.DirectoryInfo(strIndexDir)), new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29), true, IndexWriter.MaxFieldLength.UNLIMITED);

            // Find all files in root folder create index on them
            List<string> lstFiles = searchFiles(@"E:\LuceneTest\PDFs");
            foreach (string strFile in lstFiles)
            {
                Document doc = new Document();
                string FileName = System.IO.Path.GetFileNameWithoutExtension(strFile);
                string Text = ExtractTextFromPdf(strFile);
                string Path = strFile;
                string ModifiedDate = Convert.ToString(File.GetLastWriteTime(strFile));
                string DocumentType = string.Empty;
                string Vault = string.Empty;

                string headerText = Text.Substring(0, Text.Length < 150 ? Text.Length : 150);
                foreach (var docs in ltDocumentTypes)
                {
                    if (headerText.ToUpper().Contains(docs.searchText.ToUpper()))
                    {
                        DocumentType = docs.DocumentType;
                        Vault = docs.VaultName; ;
                    }
                }

                if (string.IsNullOrEmpty(DocumentType))
                {
                    DocumentType = "Default";
                    Vault = "Default";
                }

                doc.Add(new Field("filename", FileName, Field.Store.YES, Field.Index.ANALYZED));
                doc.Add(new Field("text", Text, Field.Store.YES, Field.Index.ANALYZED));
                doc.Add(new Field("path", Path, Field.Store.YES, Field.Index.NOT_ANALYZED));
                doc.Add(new Field("modifieddate", ModifiedDate, Field.Store.YES, Field.Index.ANALYZED));
                doc.Add(new Field("documenttype", DocumentType, Field.Store.YES, Field.Index.ANALYZED));
                doc.Add(new Field("vault", Vault, Field.Store.YES, Field.Index.ANALYZED));

                writer.AddDocument(doc);
            }
            writer.Optimize();
            writer.Dispose();
        }

- Munavvar

你真的需要调用writer.Optimize()吗？writer.Commit()是不是已经足够了？ - sisve

感谢 @SimonSvensson 的回复。Optimize() 不是必需的。已尝试 commit()，但性能没有改善。 - Munavvar

1

@Munavvar，在提出任何更改之前，您是否尝试为相关方法添加一些基准测试？我对searchFiles和ExtractTextFromPdf方法特别感兴趣。我认为问题可能出在后者，因为您的代码看起来很好（除了不应分析的日期）。此外，您的PDF文件大小是多少？您可以将索引和分析限制在相关数量的字符上。 - AR1

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- AndyPook · Accepted Answer

索引部分看起来没问题。请注意，IndexWriter是线程安全的，因此如果您使用Parallel.Foreach（将MaxConcurrency设置为核心数，可以尝试不同的值），那么在多核机器上可能会有所帮助。

但是，在文档类型检测部分，您让GC疯狂了。所有的 ToUpper() 操作都很痛苦。

Outside of the lstFiles loop. Create a copy of ltDocumentTypes .searchText in upper case
```
var upperDocTypes = ltDocumentTypes.Select(x=>x.searchText.ToUpper()).ToList();
```
outside of the doc types loop create another string
```
string headerTestUpper = headerText.ToUpper();
```

When it finds a match "break". This exits the loop once you've found a match and prevents all the following iterations. Of course this means match first whereas yours is match last (if that makes a difference to you)

string headerText = Text.Substring(0, Text.Length < 150 ? Text.Length : 150);
foreach (var searchText in upperDocTypes)
{
    if (headerTextUpper.Contains(searchText))
    {
        DocumentType = docs.DocumentType;
        Vault = docs.VaultName;
        break;
    }
}

根据ltDocumentTypes的大小，这可能不会给你太大的改进。

我敢打赌ExtractTextFromPdf是最昂贵的部分。通过分析器运行或加入一些秒表将向您展示成本所在。