

摘要:我想比较ɔ̃、ɛ̃ 和ɑ̃和ɔ、ɛ和a的区别,但我的文本文件中ɔ̃、ɛ̃和ɑ̃被写成了ɔ~、ɛ~和a~。



为此,我需要区分鼻音ɔ̃、ɛ̃ 和ɑ̃与其他声音的不同之处,因为它们只与自身混淆。



我尝试过“标准化”Unicode到“组合”形式,例如unicodedata.normalize('NFKC', line),但它没有改变任何东西。


0 alyʁ alɔʁ a a # this first word is done well
1 alyʁ alɔʁ l l
2 alyʁ alɔʁ y ɔ # it doesn't continue to compare the ʁ because it found the difference
0 ɑ̃bisjø ɑ̃bisjɔ̃ ɑ ɑ
1 ɑ̃bisjø ɑ̃bisjɔ̃ ̃ ̃  # the tildes are compared / treated  separately
2 ɑ̃bisjø ɑ̃bisjɔ̃ b b
3 ɑ̃bisjø ɑ̃bisjɔ̃ i i
4 ɑ̃bisjø ɑ̃bisjɔ̃ s s
5 ɑ̃bisjø ɑ̃bisjɔ̃ j j
6 ɑ̃bisjø ɑ̃bisjɔ̃ ø ɔ # luckily that wasn't where the difference was, this is
0 osi ɛ̃si o ɛ # here it should report (o, ɛ̃), not (o, ɛ)
0 bɛ̃ bɔ̃ b b
1 bɛ̃ bɔ̃ ɛ ɔ # an error of this type
0 bo ba b b
1 bo ba o a # this is working correctly 
0 bjɛ bjɛ̃ b b
1 bjɛ bjɛ̃ j j
2 bjɛ bjɛ̃ ɛ ɛ # AND here's the money, it thinks these are the same letter, but it has also run out of characters to compare from the first word, so it throws the error below
Traceback (most recent call last):

  File "C:\Users\tchak\OneDrive\Desktop\French.py", line 42, in <module>
    letter1 = line[0][index]

IndexError: string index out of range


def lens(word):
    return len(word)

# open file, and new file to write to
input_file = "./phonetics_input.txt"
output_file = "./phonetics_output.txt"
set1 = ["e", "ɛ", "œ", "ø", "ə"]
set2 = ["ø", "o", "œ", "ɔ", "ə"]
set3 = ["ə", "i", "y"]
set4 = ["u", "y", "ə"]
set5 = ["ɑ̃", "ɔ̃", "ɛ̃", "ə"]
set6 = ["a", "ə"]
vowelsets = [set1, set2, set3, set4, set5, set6]
with open(input_file, encoding="utf8") as ipf, open(output_file, encoding="utf8") as opf:
    # for line in file; 
    vowelpairs= []
    acceptedvowelpairs = []
    input_lines = ipf.readlines()
    for line in input_lines:
        #find word ipa transctipts
        unicodedata.normalize('NFKC', line)
        line = line.split("/")
        line.sort(key = lens)
        line = line[0:2] # the shortest two strings after splitting are the ipa words
        index = 0
        letter1 = line[0][index]
        letter2 = line[1][index]
        print(index, line[0], line[1], letter1, letter2)
        linelen = max(len(line[0]), len(line[1]))
        while letter1 == letter2:
            index += 1
            letter1 = line[0][index] # throws the error here, technically, after printing the last characters and incrementing the index one more
            letter2 = line[1][index]
            print(index, line[0], line[1], letter1, letter2)
        vowelpairs.append((letter1, letter2))   
    for i in vowelpairs:
        for vowelset in vowelsets:
            if set(i).issubset(vowelset):

我不完全确定我理解你的意思,但我认为LingPy可能会有所帮助。如果我没记错的话,它可以对IPA字符进行有意义的分段,包括附加符号等其他功能。 - lenz
您可以尝试使用Unidecode:https://pypi.org/project/Unidecode/ - Curtis
@Curtis Unidecode会去除重音符号,而这正是原帖中提到的不想要的。 - lenz
@lenz - 是的,但你可以删除并比较 :) - Curtis
我阅读了unidecode的信息,看起来它将替换非重音版本。我能否指定替换为文本中未使用的某些字母,以便仍然可以找到带重音字符的位置? - RukiyaMeria
我看了一些LingPy的例子,不确定如何找到我想要的东西,但我想我可以记在心里。我还编辑了问题以澄清我的需求,如果还不清楚,请告诉我。 - RukiyaMeria


对于描述的特定字符组合,Unicode规范化无助于解决问题,因为从 Unicode数据库UnicodeData.Txt中使用简单的正则表达式"Latin.*Letter.*with tilde$"抽取得到的拉丁字母包括ÃÑÕãñõĨĩŨũṼṽẼẽỸỹ(没有拉丁字母Open OOpen EAlpha)。因此,您需要分别迭代比较两个字符串,方法如下(省略了上面的代码 最小可重现示例):

import unicodedata

def lens(word):
    return len(word)

input_lines = ['alyʁ/alɔʁ', 'ɑ̃bisjø/ɑ̃bisjɔ̃ ', 'osi/ɛ̃si', 'bɛ̃ /bɔ̃ ', 'bo/ba', 'bjɛ/bjɛ̃ ']
for line in input_lines:
    #find word ipa transctipts
    line = unicodedata.normalize('NFKC', line.rstrip('\n'))
    line = line.split("/")
    line.sort(key = lens)
    word1, word2 = line[0:2] # the shortest two strings after splitting are the ipa words
    index = i1 = i2 = 0
    while i1 < len(word1) and i2 < len(word2):
        letter1 = word1[i1]
        i1 += 1
        if i1 < len(word1) and unicodedata.category(word1[i1]) == 'Mn':
            letter1 += word1[i1]
            i1 += 1
        letter2 = word2[i2]
        i2 += 1
        if i2 < len(word2) and unicodedata.category(word2[i2]) == 'Mn':
            letter2 += word2[i2]
            i2 += 1
        same = chr(0xA0) if letter1 == letter2 else '#' 
        print(index, same, word1, word2, letter1, letter2)
        index += 1
        #if same != chr(0xA0):
        #    break

Output: .\SO\67335977.py


0   alyʁ alɔʁ a a
1   alyʁ alɔʁ l l
2 # alyʁ alɔʁ y ɔ
3   alyʁ alɔʁ ʁ ʁ

0   ɑ̃bisjø ɑ̃bisjɔ̃  ɑ̃ ɑ̃
1   ɑ̃bisjø ɑ̃bisjɔ̃  b b
2   ɑ̃bisjø ɑ̃bisjɔ̃  i i
3   ɑ̃bisjø ɑ̃bisjɔ̃  s s
4   ɑ̃bisjø ɑ̃bisjɔ̃  j j
5 # ɑ̃bisjø ɑ̃bisjɔ̃  ø ɔ̃

0 # osi ɛ̃si o ɛ̃
1   osi ɛ̃si s s
2   osi ɛ̃si i i

0   bɛ̃  bɔ̃  b b
1 # bɛ̃  bɔ̃  ɛ̃ ɔ̃
2   bɛ̃  bɔ̃

0   bo ba b b
1 # bo ba o a

0   bjɛ bjɛ̃  b b
1   bjɛ bjɛ̃  j j
2 # bjɛ bjɛ̃  ɛ ɛ̃

注意变音符号被测试为Unicode类别Mn; 您可以针对另一个条件进行测试(例如以下列表中的条件之一):

  • Mn Nonspacing_Mark:非间距组合标记(零进位宽度)
  • Mc Spacing_Mark:间距组合标记(正进位宽度)
  • Me Enclosing_Mark:封闭组合标记
  • M Mark:Mn | Mc | Me



这不是一个答案。 - user1142217

网页内容由stack overflow 提供, 点击上面的