无法在lucene索引中搜索所有生成的术语

Question

无法在lucene索引中搜索所有生成的术语

5

我正在使用自定义分析器对代码进行索引和搜索。给定文本“will wi-fi work”，会生成以下标记（'will'是停用词，被排除）：

wi-fi {position:2 start:5 end:10}
wifi {position:2 start:5 end:10}
wi {position:2 start:5 end:7}
fi {position:2 start:8 end:10}
work {position:3 start:11 end:15}

当我搜索wi-fi、work这些词时，我可以得到搜索结果。但是，当我搜索wifi、wi、fi这些词时，无论是短语还是非短语查询，都没有返回任何结果。生成的令牌有问题吗？

解析后的搜索查询：

对于wi-fi（正常工作）。

Lucene's: +matchAllDocs:true +(alltext:wi-fi alltext:wifi alltext:wi alltext:fi)

关于wifi的问题(无结果返回)

Lucene's: +matchAllDocs:true +alltext:wifi

"Wi-Fi会工作吗" (表现良好)

Lucene's: +matchAllDocs:true +alltext:"(wi-fi wifi wi fi) work"

对于“will wifi work”的问题，没有返回结果。

Lucene's: +matchAllDocs:true +alltext:"? wifi work"

更新

发现问题：

public boolean incrementToken() throws IOException
{
    /*
     * first return all tokens in the list
     */
    if (tokens.size() > 0)
    {
        Token top = tokens.removeFirst();
        restoreState(current);
        **termAtt.setEmpty().append(new String(top.buffer(), 0, top.length()));**
        offsetAtt.setOffset(top.startOffset(), top.endOffset());
        posIncrAtt.setPositionIncrement(0);
        return true;
    }

    /*
     * if there are no more incoming tokens return false
     */
    if (!input.incrementToken())
        return false;

    Token wrapper = new Token();
    wrapper.copyBuffer(termAtt.buffer(), 0, termAtt.length());
    wrapper.setStartOffset(offsetAtt.startOffset());
    wrapper.setEndOffset(offsetAtt.endOffset());
    wrapper.setPositionIncrement(posIncrAtt.getPositionIncrement());

    normalizeHyphens(wrapper);
    current = captureState();
    return true;
}

在上面加粗的那行中，我是在说

termAtt.setEmpty().append(new String(top.buffer()));

当我搜索wi时，没有任何结果，但是wi*可以给出结果。看起来这个top.buffer()包含一些额外的垃圾，导致奇怪的行为。

浪费了一整天时间 :(

- naresh

你能看到解析后的查询长什么样子吗？我认为你可以使用它的toString来查看。此外，你已经双重检查确保索引中有哪些术语了吗？ - Marko Topolnik

1

你能贴上调试输出吗？这将使我们看到查询是如何解析的。http://wiki.apache.org/solr/CommonQueryParameters#debugQuery - jpountz

1

好的，所以你的问题发生在索引时间。据我所知，您有自己的分析器，希望在索引时间处活动，因此可能只是错误配置了索引过程，导致您的自定义分析器无法通过。顺便问一下，matchAllDocs字段是您添加的还是Solr自动执行的？我问这个是因为MatchAllDocsQuery应该已经满足了它的作用。 - Marko Topolnik

@MarkoTopolnik，关于MatchAllDocsQuery，你说得对。我已经更正了我的分析器（其中一个令牌过滤器被注释掉了）。我已经更新了相关问题的解析查询。现在我也能在索引中看到术语了，但问题仍然存在。http://imagebin.org/207419 - naresh

这变得有些奇怪了，现在它应该会给出结果。我现在能想到的是，你的搜索过程可能在使用一个过时的IndexSearcher，它仍然看不到更改，而Luke则可以看到索引的正确状态。抱歉，我已经没有更多的想法了。 - Marko Topolnik

显示剩余6条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- user349026 · Accepted Answer

只是猜测，不知道您的分析器或解析器。

确保您在搜索中使用的单词不是停用词的一部分。可能停用词列表文件是您检查的地方
分面搜索/加权搜索。确保您没有搞乱这些。
在解析/分析之后，请确保您获得了要搜索的标记化术语。
确保您的术语被推入索引中。