如何找出哪些句子有最多共同的单词？

Question

如何找出哪些句子有最多共同的单词？

3

假设我有一个段落，我通过sent_tokenize将其分成句子：

variable = ['By the 1870s the scientific community and much of the general public had accepted evolution as a fact.',
    'However, many favoured competing explanations and it was not until the emergence of the modern evolutionary synthesis from the 1930s to the 1950s that a broad consensus developed in which natural selection was the basic mechanism of evolution.',
    'Darwin published his theory of evolution with compelling evidence in his 1859 book On the Origin of Species, overcoming scientific rejection of earlier concepts of transmutation of species.']

现在我将每个句子分成单词，并将其附加到某个变量中。我该如何找到具有最多相同单词数量的两组句子？我不确定该怎么做。如果我有10个句子，那么我将进行90次检查（在每个句子之间）。谢谢。

- user2878953

实际上是45个检查，而不是90个。由于顺序无关紧要，因此您可以除以2。 - alexis

2个回答

1

import itertools

sentences = ["There is no subtle meaning in this.", "Don't analyze this!", "What is this sentence?"]
decomposedsentences = ((index, set(sentence.strip(".?!,").split(" "))) for index, sentence in enumerate(sentences))
s1,s2 = max(itertools.combinations(decomposedsentences, 2), key = lambda sentences: len(sentences[0][1]&sentences[1][1]))
print("The two sentences with the most common words", sentences[s1[0]], sentences[s2[0]])

- Ramchandra Apte

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- veiset · Accepted Answer

您可以使用 Python 的交集集合。如果您有以下三个句子：

a = "a b c d"
b = "a c x y"
c = "a q v"

您可以通过以下方式检查两个句子中相同单词的数量：

sameWords = set.intersection(set(a.split(" ")), set(c.split(" ")))
numberOfWords = len(sameWords)

使用这个方法，您可以迭代遍历句子列表，并找到其中具有最多相同单词的两个句子。这给我们：

sentences = ["a b c d", "a d e f", "c x y", "a b c d x"]

def similar(s1, s2):
    sameWords = set.intersection(set(s1.split(" ")), set(s2.split(" ")))
    return len(sameWords)

currentSimilar = 0
s1 = ""
s2 = ""

for sentence in sentences:
    for sentence2 in sentences:
        if sentence is sentence2:
            continue
        similiarity = similar(sentence, sentence2)
        if (similiarity > currentSimilar):
            s1 = sentence
            s2 = sentence2
            currentSimilar = similiarity

print(s1, s2)

如果性能是一个问题，那么这个问题可能有一些关于动态规划的解决方案。