如何使用difflib仅突出显示单词错误？

Question

如何使用difflib仅突出显示单词错误？

3

我试图比较语音转文本API的输出结果和真实的转录结果。我想要做的是将语音转文本API错过或错误解释的单词在真实结果中标记出来。

例如：

真实结果： The quick brown fox jumps over the lazy dog. 语音转文本输出： the quick brown box jumps over the dog 期望结果： The quick brown FOX jumps over the LAZY dog. 我的初步想法是从真实结果中删除大写字母和标点符号，然后使用difflib。这可以得到准确的差异，但我无法将输出映射回原始文本的位置。即使我只对单词错误感兴趣，我也希望保留真实结果的大写字母和标点符号以显示结果。

有没有办法将difflib输出表示为原始文本上的单词级别更改？

- user13969403

谢谢大家的回复！我想我的问题不是获取差异，而是将差异映射到带有大小写和标点符号的真实版本上，以便以漂亮的格式显示结果。如果我表达不清楚，还请见谅。 - user13969403

嗨，你有没有看到我的答案？我认为你的例子有点太简单了，因为它没有考虑到可能遇到的缺失单词或其他问题。 - Pitto

所以，除非我漏掉了什么（这是有可能的），输出结果并不完全符合我的需求，因为它没有保留原始文本中的大写和标点符号。不过，我意识到我的解决方案并不是最好的 - 它在简单直接的情况下还可以工作，但我还没有对更奇怪的情况进行过太多测试。 - user13969403

3个回答

0

为什么不将句子拆分成单词，然后在这些单词上使用 difflib 呢？

import difflib

truth = 'The quick brown fox jumps over the lazy dog.'.lower().strip(
    '.').split()

speech = 'the quick brown box jumps over the dog'.lower().strip('.').split()

for d in difflib.ndiff(truth, speech):
    print(d)

- user5386938

0

所以我想我已经解决了这个问题。我意识到difflib的"contextdiff"提供了具有更改的行的索引。为了获得“真实文本”的索引，我移除了大写字母/标点符号，将文本分割成单独的单词，然后进行以下操作：


altered_word_indices = []
diff = difflib.context_diff(transformed_ground_truth, transformed_hypothesis, n=0)
for line in diff:
  if line.startswith('*** ') and line.endswith(' ****\n'):
    line = line.replace(' ', '').replace('\n', '').replace('*', '')
    if ',' in line:
      split_line = line.split(',')
      for i in range(0, (int(split_line[1]) - int(split_line[0])) + 1):
        altered_word_indices.append((int(split_line[0]) + i) - 1)
    else:
      altered_word_indices.append(int(line) - 1)

在这之后，我打印出来的文字中，将改变的单词大写：

split_ground_truth = ground_truth.split(' ')
for i in range(0, len(split_ground_truth)):
    if i in altered_word_indices:
        print(split_ground_truth[i].upper(), end=' ')
    else:
        print(split_ground_truth[i], end=' ')

这样我就可以打印出"快速的棕色狐狸跳过懒狗"（包括大写字母/标点符号），而不是"快速的棕色狐狸跳过懒狗"。

这并不是一个非常优雅的解决方案，还需要进行测试、清理、错误处理等等。但它似乎是一个不错的起点，并且对于其他遇到同样问题的人可能有用。我会将这个问题保持开放几天，以防有人提出更好的方法来达到相同的结果。

- user13969403

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Pitto · Accepted Answer

我还想建议使用difflib来解决问题，但我更喜欢使用RegEx进行单词检测，因为它会更精确并且对奇怪的字符和其他问题更加容忍。

我已经在您的原始字符串中添加了一些奇怪的文本以展示我的意思：

import re
import difflib

truth = 'The quick! brown - fox jumps, over the lazy dog.'
speech = 'the quick... brown box jumps. over the dog'

truth = re.findall(r"[\w']+", truth.lower())
speech = re.findall(r"[\w']+", speech.lower())

for d in difflib.ndiff(truth, speech):
    print(d)

输出

  the
  quick
  brown
- fox
+ box
  jumps
  over
  the
- lazy
  dog

另一个可能的输出：

diff = difflib.unified_diff(truth, speech)
print(''.join(diff))

输出

---
+++
@@ -1,9 +1,8 @@
 the quick brown-fox+box jumps over the-lazy dog