Java：如何在字符串列表中找到最可能的字符串？

Question

Java：如何在字符串列表中找到最可能的字符串？

6

我有一个Java字符串列表，其中包含人名的不同拼写方式（并非完全不同）。例如，John可能被拼写为Jon、Jawn、Jaun等。我该如何在此列表中检索出最合适的字符串？如果有人能够建议如何在这种情况下使用Soundex方法，那将非常有帮助。

- jigsawmnc

5个回答

4

您已经使用了近似字符串匹配算法，有几种策略可以实现这一目的。Blur是一个基于Trie树的Java实现，它使用Levenshtein字距离进行近似字符串匹配。

另外一种实现策略称为Boyer-Moore近似字符串匹配算法。

通常使用这个算法和Levenshtein字距离来解决这些问题的方法是将输入与可能的输出进行比较，并选择与所需输出距离最小的那个。

- Aravind.HU

1

这篇文章提供了一个基于Trie的Java实现近似字符串匹配的详细解释和完整代码：使用Trie快速简单计算Levenshtein距离。

搜索函数返回所有与目标词的距离小于给定最大距离的单词列表

def search( word, maxCost ):

# build first row
currentRow = range( len(word) + 1 )

results = []

# recursively search each branch of the trie
for letter in trie.children:
    searchRecursive( trie.children[letter], letter, word, currentRow, 
        results, maxCost )

return results

这个递归助手被上面的搜索函数使用。它假设之前的行已经填好了。

def searchRecursive( node, letter, word, previousRow, results, maxCost )：

columns = len( word ) + 1
currentRow = [ previousRow[0] + 1 ]

# Build one row for the letter, with a column for each letter in the target
# word, plus one for the empty string at column 0
for column in xrange( 1, columns ):

    insertCost = currentRow[column - 1] + 1
    deleteCost = previousRow[column] + 1

    if word[column - 1] != letter:
        replaceCost = previousRow[ column - 1 ] + 1
    else:                
        replaceCost = previousRow[ column - 1 ]

    currentRow.append( min( insertCost, deleteCost, replaceCost ) )

# if the last entry in the row indicates the optimal cost is less than the
# maximum cost, and there is a word in this trie node, then add it.
if currentRow[-1] <= maxCost and node.word != None:
    results.append( (node.word, currentRow[-1] ) )

# if any entries in the row are less than the maximum cost, then 
# recursively search each branch of the trie
if min( currentRow ) <= maxCost:
    for letter in node.children:
        searchRecursive( node.children[letter], letter, word, currentRow, 
            results, maxCost )

- jeffery.yuan

1

有许多理论和方法可以估计2个字符串的匹配度。

给出一个简单的真/假结果似乎很奇怪，因为“jon”确实不等于“john”，它很接近，但并不匹配。

一个实现了相当多估算方法的优秀学术作品叫做“SecondString.jar” - 站点链接

大多数实现的方法会给匹配一个得分，这个得分取决于所使用的方法。

例如：我们将“编辑距离”定义为在str1中需要更改的字符数量才能得到str2，在这种情况下，“jon”-->“john”需要添加1个字符，自然而然，对于这种方法较低的得分更好。

- Jaakov Jalink

1

如果在索引文本时使用语音过滤工厂，Solr可以做到这一点。

Solr很擅长搜索，并搜索类似发音的单词。但是，如果您只想要这个功能，而不需要Solr提供的其他功能，则可以使用此处提供的源代码。

- Zasz

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Dipen Jogi · Accepted Answer

有一个用于匹配近似字符串的jar文件。

请访问链接并下载frej.jar文件。

http://sourceforge.net/projects/frej/files/

该jar文件中包含一种方法。

Fuzzy.equals("jon","john");

在这种类型的近似字符串中，它将返回true。