如何在Python中比较带有组合变音符号的字符ɔ̃、ɛ̃和ɑ̃和没有变音符号的字符（从UTF-8编码的文本文件导入）？

Question

如何在Python中比较带有组合变音符号的字符ɔ̃、ɛ̃和ɑ̃和没有变音符号的字符（从UTF-8编码的文本文件导入）？

pythonstringutf-8diacriticscombining-marks

3

摘要：我想比较ɔ̃、ɛ̃ 和ɑ̃和ɔ、ɛ和a的区别，但我的文本文件中ɔ̃、ɛ̃和ɑ̃被写成了ɔ~、ɛ~和a~。

我编写了一个脚本，同时沿着两个词的字符移动，比较它们以找到不同的字符对。这些单词长度相等（除了变音符号引入的额外字符），代表法语中相差一个音素的发音的两个词。

最终目标是过滤一组Anki卡片，只包含某些音素对，因为其他音素对太容易被识别了。每个词对代表一个Anki笔记。

为此，我需要区分鼻音ɔ̃、ɛ̃ 和ɑ̃与其他声音的不同之处，因为它们只与自身混淆。

如其所写，该代码将带重音的字符视为字符加上~，因此视为两个字符。因此，如果一个单词仅有一个尾部带重音的字符和没有重音的字符之间的差异，则脚本在最后一个字母上找不到差异，然后会发现一个单词比另一个短（另一个仍然有一个~），并在比较一个多余字符时抛出错误。这本身就是一个“问题”，但如果我可以让带重音的字符读作单个单位，则两个单词将具有相同的长度，它将消失。

我不想用非重音符号替换带重音的字符，因为它们发出不同的声音。

我尝试过“标准化”Unicode到“组合”形式，例如unicodedata.normalize('NFKC', line)，但它没有改变任何东西。

下面是一些输出，包括仅抛出错误的行；打印输出显示了代码正在比较的每个单词和该单词中的字符；数字是该字符在词中的索引。因此，最后一个字母是脚本“认为”这两个字符是什么，它看到ɛ̃和ɛ是相同的。然后，在报告差异时选择了错误的字符对，而选定正确的字符对很重要，因为我会与可允许的字符对的主列表进行比较。

0 alyʁ alɔʁ a a # this first word is done well
1 alyʁ alɔʁ l l
2 alyʁ alɔʁ y ɔ # it doesn't continue to compare the ʁ because it found the difference
...
0 ɑ̃bisjø ɑ̃bisjɔ̃ ɑ ɑ
1 ɑ̃bisjø ɑ̃bisjɔ̃ ̃ ̃  # the tildes are compared / treated  separately
2 ɑ̃bisjø ɑ̃bisjɔ̃ b b
3 ɑ̃bisjø ɑ̃bisjɔ̃ i i
4 ɑ̃bisjø ɑ̃bisjɔ̃ s s
5 ɑ̃bisjø ɑ̃bisjɔ̃ j j
6 ɑ̃bisjø ɑ̃bisjɔ̃ ø ɔ # luckily that wasn't where the difference was, this is
...
0 osi ɛ̃si o ɛ # here it should report (o, ɛ̃), not (o, ɛ)
...
0 bɛ̃ bɔ̃ b b
1 bɛ̃ bɔ̃ ɛ ɔ # an error of this type
...
0 bo ba b b
1 bo ba o a # this is working correctly 
...
0 bjɛ bjɛ̃ b b
1 bjɛ bjɛ̃ j j
2 bjɛ bjɛ̃ ɛ ɛ # AND here's the money, it thinks these are the same letter, but it has also run out of characters to compare from the first word, so it throws the error below
Traceback (most recent call last):

  File "C:\Users\tchak\OneDrive\Desktop\French.py", line 42, in <module>
    letter1 = line[0][index]

IndexError: string index out of range

以下是代码：

def lens(word):
    return len(word)

# open file, and new file to write to
input_file = "./phonetics_input.txt"
output_file = "./phonetics_output.txt"
set1 = ["e", "ɛ", "œ", "ø", "ə"]
set2 = ["ø", "o", "œ", "ɔ", "ə"]
set3 = ["ə", "i", "y"]
set4 = ["u", "y", "ə"]
set5 = ["ɑ̃", "ɔ̃", "ɛ̃", "ə"]
set6 = ["a", "ə"]
vowelsets = [set1, set2, set3, set4, set5, set6]
with open(input_file, encoding="utf8") as ipf, open(output_file, encoding="utf8") as opf:
    # for line in file; 
    vowelpairs= []
    acceptedvowelpairs = []
    input_lines = ipf.readlines()
    print(len(input_lines))
    for line in input_lines:
        #find word ipa transctipts
        unicodedata.normalize('NFKC', line)
        line = line.split("/")
        line.sort(key = lens)
        line = line[0:2] # the shortest two strings after splitting are the ipa words
        index = 0
        letter1 = line[0][index]
        letter2 = line[1][index]
        print(index, line[0], line[1], letter1, letter2)
            
        linelen = max(len(line[0]), len(line[1]))
        while letter1 == letter2:
            index += 1
            letter1 = line[0][index] # throws the error here, technically, after printing the last characters and incrementing the index one more
            letter2 = line[1][index]
            print(index, line[0], line[1], letter1, letter2)
            
        vowelpairs.append((letter1, letter2))   
        
    for i in vowelpairs:
        for vowelset in vowelsets:
            if set(i).issubset(vowelset):
                acceptedvowelpairs.append(i)
    print(len(vowelpairs))
    print(len(acceptedvowelpairs))

- RukiyaMeria

我不完全确定我理解你的意思，但我认为LingPy可能会有所帮助。如果我没记错的话，它可以对IPA字符进行有意义的分段，包括附加符号等其他功能。 - lenz

您可以尝试使用Unidecode：https://pypi.org/project/Unidecode/ - Curtis

1

@Curtis Unidecode会去除重音符号，而这正是原帖中提到的不想要的。 - lenz

@lenz - 是的，但你可以删除并比较 :) - Curtis

我阅读了unidecode的信息，看起来它将替换非重音版本。我能否指定替换为文本中未使用的某些字母，以便仍然可以找到带重音字符的位置？ - RukiyaMeria

我看了一些LingPy的例子，不确定如何找到我想要的东西，但我想我可以记在心里。我还编辑了问题以澄清我的需求，如果还不清楚，请告诉我。 - RukiyaMeria

2个回答

0

我正在通过在处理之前进行查找和替换这些字符，完成后再进行反向查找和替换来解决这个问题。

- RukiyaMeria

这不是一个答案。 - user1142217

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- JosefZ · Accepted Answer

对于描述的特定字符组合，Unicode规范化无助于解决问题，因为从 Unicode数据库UnicodeData.Txt中使用简单的正则表达式"Latin.*Letter.*with tilde$"抽取得到的拉丁字母包括ÃÑÕãñõĨĩŨũṼṽẼẽỸỹ（没有拉丁字母Open O，Open E或Alpha）。因此，您需要分别迭代比较两个字符串，方法如下（省略了上面的代码最小可重现示例）：

import unicodedata

def lens(word):
    return len(word)

input_lines = ['alyʁ/alɔʁ', 'ɑ̃bisjø/ɑ̃bisjɔ̃ ', 'osi/ɛ̃si', 'bɛ̃ /bɔ̃ ', 'bo/ba', 'bjɛ/bjɛ̃ ']
print(len(input_lines))
for line in input_lines:
    print('')
    #find word ipa transctipts
    line = unicodedata.normalize('NFKC', line.rstrip('\n'))
    line = line.split("/")
    line.sort(key = lens)
    word1, word2 = line[0:2] # the shortest two strings after splitting are the ipa words
    index = i1 = i2 = 0
    while i1 < len(word1) and i2 < len(word2):
        letter1 = word1[i1]
        i1 += 1
        if i1 < len(word1) and unicodedata.category(word1[i1]) == 'Mn':
            letter1 += word1[i1]
            i1 += 1
        letter2 = word2[i2]
        i2 += 1
        if i2 < len(word2) and unicodedata.category(word2[i2]) == 'Mn':
            letter2 += word2[i2]
            i2 += 1
        same = chr(0xA0) if letter1 == letter2 else '#' 
        print(index, same, word1, word2, letter1, letter2)
        index += 1
        #if same != chr(0xA0):
        #    break

Output: .\SO\67335977.py

6

0   alyʁ alɔʁ a a
1   alyʁ alɔʁ l l
2 # alyʁ alɔʁ y ɔ
3   alyʁ alɔʁ ʁ ʁ

0   ɑ̃bisjø ɑ̃bisjɔ̃  ɑ̃ ɑ̃
1   ɑ̃bisjø ɑ̃bisjɔ̃  b b
2   ɑ̃bisjø ɑ̃bisjɔ̃  i i
3   ɑ̃bisjø ɑ̃bisjɔ̃  s s
4   ɑ̃bisjø ɑ̃bisjɔ̃  j j
5 # ɑ̃bisjø ɑ̃bisjɔ̃  ø ɔ̃

0 # osi ɛ̃si o ɛ̃
1   osi ɛ̃si s s
2   osi ɛ̃si i i

0   bɛ̃  bɔ̃  b b
1 # bɛ̃  bɔ̃  ɛ̃ ɔ̃
2   bɛ̃  bɔ̃

0   bo ba b b
1 # bo ba o a

0   bjɛ bjɛ̃  b b
1   bjɛ bjɛ̃  j j
2 # bjɛ bjɛ̃  ɛ ɛ̃

注意：变音符号被测试为Unicode类别Mn; 您可以针对另一个条件进行测试（例如以下列表中的条件之一）：

Mn Nonspacing_Mark：非间距组合标记（零进位宽度）
Mc Spacing_Mark：间距组合标记（正进位宽度）
Me Enclosing_Mark：封闭组合标记
M Mark：Mn | Mc | Me