我该如何在字符串中分割泰米尔字符?
当我使用preg_match_all('/./u', $str, $results)
时,我得到了字符“த”,“ம”,“ி”,“ழ”和“்”。
我该如何获取组合字符“த”,“மி”和“ழ்”?
我认为你可以使用grapheme_extract
函数来遍历组合字符(在技术上称为“图形簇”)。
另外,如果你更喜欢正则表达式的方法,我认为你可以使用以下代码:
preg_match_all('/\pL\pM*|./u', $str, $results)
\pL
表示 Unicode "字母",\pM
表示 Unicode "标记"。
(免责声明:我没有测试过这些方法。)
元音字母(உயிர் எழுத்து)、aytham(ஆய்த எழுத்து - ஃ)以及所有组合((உயிர்-மெய் எழுத்து)在'a'列中(அ வரி - 即 க、ச、ட、த、ப、ற、ங、ஞ、ண、ந、ம、ன、ய、ர、ள、வ、ழ、ல)每个使用单个码位。
每个辅音都由两个码位组成:a组合字母+ pulli。例如:ப் = ப + ்
除了a组合之外的每个组合也由两个码位组成:a组合字母+标记:例如:பி = ப் + ி,தை = த் + ை
因此,如果您的逻辑是这样的:
initialize an empty array
for each codepoint in word:
if the codepoint is a vowel, a-combination or aytham, it is also its grapheme, so add it to the array
otherwise, the codepoint is a marking such as the pulli (i.e. ்) or one of the combination extensions (e.g. ி or ை), so append it to the end of the last element of the array
@staticmethod
def split_letters(word=u''):
""" Returns the graphemes (i.e. the Tamil characters) in a given word as a list """
# ensure that the word is a valid word
TamilWord.validate(word)
# list (which will be returned to user)
letters = []
# a tuple of all combination endings and of all அ combinations
combination_endings = TamilLetter.get_combination_endings()
a_combinations = TamilLetter.get_combination_column(u'அ').values()
# loop through each codepoint in the input string
for codepoint in word:
# if codepoint is an அ combination, a vowel, aytham or a space,
# add it to the list
if codepoint in a_combinations or \
TamilLetter.is_whitespace(codepoint) or \
TamilLetter.is_vowel(codepoint) or \
TamilLetter.is_aytham(codepoint):
letters.append(codepoint)
# if codepoint is a combination ending or a pulli ('்'), add it
# to the end of the previously-added codepoint
elif codepoint in combination_endings or \
codepoint == TamilLetter.get_pulli():
# ensure that at least one character already exists
if len(letters) > 0:
letters[-1] = letters[-1] + codepoint
# otherwise raise an Error. However, validate_word()
# should catch this
else:
raise ValueError("""%s cannot be first character of a word""" % (codepoint))
return letters