摘要:我想比较ɔ̃、ɛ̃ 和ɑ̃和ɔ、ɛ和a的区别,但我的文本文件中ɔ̃、ɛ̃和ɑ̃被写成了ɔ~、ɛ~和a~。
我编写了一个脚本,同时沿着两个词的字符移动,比较它们以找到不同的字符对。这些单词长度相等(除了变音符号引入的额外字符),代表法语中相差一个音素的发音的两个词。
最终目标是过滤一组Anki卡片,只包含某些音素对,因为其他音素对太容易被识别了。每个词对代表一个Anki笔记。
为此,我需要区分鼻音ɔ̃、ɛ̃ 和ɑ̃与其他声音的不同之处,因为它们只与自身混淆。
如其所写,该代码将带重音的字符视为字符加上~,因此视为两个字符。因此,如果一个单词仅有一个尾部带重音的字符和没有重音的字符之间的差异,则脚本在最后一个字母上找不到差异,然后会发现一个单词比另一个短(另一个仍然有一个~),并在比较一个多余字符时抛出错误。这本身就是一个“问题”,但如果我可以让带重音的字符读作单个单位,则两个单词将具有相同的长度,它将消失。
我不想用非重音符号替换带重音的字符,因为它们发出不同的声音。
我尝试过“标准化”Unicode到“组合”形式,例如unicodedata.normalize('NFKC', line)
,但它没有改变任何东西。
下面是一些输出,包括仅抛出错误的行;打印输出显示了代码正在比较的每个单词和该单词中的字符;数字是该字符在词中的索引。因此,最后一个字母是脚本“认为”这两个字符是什么,它看到ɛ̃和ɛ是相同的。然后,在报告差异时选择了错误的字符对,而选定正确的字符对很重要,因为我会与可允许的字符对的主列表进行比较。
0 alyʁ alɔʁ a a # this first word is done well
1 alyʁ alɔʁ l l
2 alyʁ alɔʁ y ɔ # it doesn't continue to compare the ʁ because it found the difference
...
0 ɑ̃bisjø ɑ̃bisjɔ̃ ɑ ɑ
1 ɑ̃bisjø ɑ̃bisjɔ̃ ̃ ̃ # the tildes are compared / treated separately
2 ɑ̃bisjø ɑ̃bisjɔ̃ b b
3 ɑ̃bisjø ɑ̃bisjɔ̃ i i
4 ɑ̃bisjø ɑ̃bisjɔ̃ s s
5 ɑ̃bisjø ɑ̃bisjɔ̃ j j
6 ɑ̃bisjø ɑ̃bisjɔ̃ ø ɔ # luckily that wasn't where the difference was, this is
...
0 osi ɛ̃si o ɛ # here it should report (o, ɛ̃), not (o, ɛ)
...
0 bɛ̃ bɔ̃ b b
1 bɛ̃ bɔ̃ ɛ ɔ # an error of this type
...
0 bo ba b b
1 bo ba o a # this is working correctly
...
0 bjɛ bjɛ̃ b b
1 bjɛ bjɛ̃ j j
2 bjɛ bjɛ̃ ɛ ɛ # AND here's the money, it thinks these are the same letter, but it has also run out of characters to compare from the first word, so it throws the error below
Traceback (most recent call last):
File "C:\Users\tchak\OneDrive\Desktop\French.py", line 42, in <module>
letter1 = line[0][index]
IndexError: string index out of range
以下是代码:
def lens(word):
return len(word)
# open file, and new file to write to
input_file = "./phonetics_input.txt"
output_file = "./phonetics_output.txt"
set1 = ["e", "ɛ", "œ", "ø", "ə"]
set2 = ["ø", "o", "œ", "ɔ", "ə"]
set3 = ["ə", "i", "y"]
set4 = ["u", "y", "ə"]
set5 = ["ɑ̃", "ɔ̃", "ɛ̃", "ə"]
set6 = ["a", "ə"]
vowelsets = [set1, set2, set3, set4, set5, set6]
with open(input_file, encoding="utf8") as ipf, open(output_file, encoding="utf8") as opf:
# for line in file;
vowelpairs= []
acceptedvowelpairs = []
input_lines = ipf.readlines()
print(len(input_lines))
for line in input_lines:
#find word ipa transctipts
unicodedata.normalize('NFKC', line)
line = line.split("/")
line.sort(key = lens)
line = line[0:2] # the shortest two strings after splitting are the ipa words
index = 0
letter1 = line[0][index]
letter2 = line[1][index]
print(index, line[0], line[1], letter1, letter2)
linelen = max(len(line[0]), len(line[1]))
while letter1 == letter2:
index += 1
letter1 = line[0][index] # throws the error here, technically, after printing the last characters and incrementing the index one more
letter2 = line[1][index]
print(index, line[0], line[1], letter1, letter2)
vowelpairs.append((letter1, letter2))
for i in vowelpairs:
for vowelset in vowelsets:
if set(i).issubset(vowelset):
acceptedvowelpairs.append(i)
print(len(vowelpairs))
print(len(acceptedvowelpairs))
LingPy
可能会有所帮助。如果我没记错的话,它可以对IPA字符进行有意义的分段,包括附加符号等其他功能。 - lenz