Python - pyparsing Unicode 字符

Question

Python - pyparsing Unicode 字符

14

:) 我尝试使用 w = Word(printables)，但它不起作用。我应该如何为此提供规范？'w' 应该处理印地语字符 (UTF-8)

代码指定语法并相应地解析。

671.assess  :: अहसास  ::2
x=number + "." + src + "::" + w + "::" + number + "." + number

如果只有英文字符，那么它可以工作，因此代码对于ASCII格式是正确的，但是该代码在Unicode格式下无法工作。

我的意思是，当我们有类似以下形式的内容时，代码可以工作： 671.assess :: ahsaas ::2

也就是说，它可以解析英文格式的单词，但我不确定如何解析并打印Unicode格式的字符。我需要这个来进行英语印地语单词对齐。

Python代码如下：

# -*- coding: utf-8 -*-
from pyparsing import Literal, Word, Optional, nums, alphas, ZeroOrMore, printables , Group , alphas8bit , 
# grammar 
src = Word(printables)
trans =  Word(printables)
number = Word(nums)
x=number + "." + src + "::" + trans + "::" + number + "." + number
#parsing for eng-dict
efiledata = open('b1aop_or_not_word.txt').read()
eresults = x.parseString(efiledata)
edict1 = {}
edict2 = {}
counter=0
xx=list()
for result in eresults:
  trans=""#translation string
  ew=""#english word
  xx=result[0]
  ew=xx[2]
  trans=xx[4]   
  edict1 = { ew:trans }
  edict2.update(edict1)
print len(edict2) #no of entries in the english dictionary
print "edict2 has been created"
print "english dictionary" , edict2 

#parsing for hin-dict
hfiledata = open('b1aop_or_not_word.txt').read()
hresults = x.scanString(hfiledata)
hdict1 = {}
hdict2 = {}
counter=0
for result in hresults:
  trans=""#translation string
  hw=""#hin word
  xx=result[0]  
  hw=xx[2]
  trans=xx[4]
  #print trans
  hdict1 = { trans:hw }
  hdict2.update(hdict1)

print len(hdict2) #no of entries in the hindi dictionary
print"hdict2 has been created"
print "hindi dictionary" , hdict2
'''
#######################################################################################################################

def translate(d, ow, hinlist):
   if ow in d.keys():#ow=old word d=dict
    print ow , "exists in the dictionary keys"
        transes = d[ow]
    transes = transes.split()
        print "possible transes for" , ow , " = ", transes
        for word in transes:
            if word in hinlist:
        print "trans for" , ow , " = ", word
                return word
        return None
   else:
        print ow , "absent"
        return None

f = open('bidir','w')
#lines = ["'\
#5# 10 # and better performance in business in turn benefits consumers .  # 0 0 0 0 0 0 0 0 0 0 \
#5# 11 # vHyaapaar mEmn bEhtr kaam upbhOkHtaaomn kE lIe laabhpHrdd hOtaa hAI .  # 0 0 0 0 0 0 0 0 0 0 0 \
#'"]
data=open('bi_full_2','rb').read()
lines = data.split('!@#$%')
loc=0
for line in lines:
    eng, hin = [subline.split(' # ')
                for subline in line.strip('\n').split('\n')]

    for transdict, source, dest in [(edict2, eng, hin),
                                    (hdict2, hin, eng)]:
        sourcethings = source[2].split()
        for word in source[1].split():
            tl = dest[1].split()
            otherword = translate(transdict, word, tl)
            loc = source[1].split().index(word)
            if otherword is not None:
                otherword = otherword.strip()
                print word, ' <-> ', otherword, 'meaning=good'
                if otherword in dest[1].split():
                    print word, ' <-> ', otherword, 'trans=good'
                    sourcethings[loc] = str(
                        dest[1].split().index(otherword) + 1)

        source[2] = ' '.join(sourcethings)

    eng = ' # '.join(eng)
    hin = ' # '.join(hin)
    f.write(eng+'\n'+hin+'\n\n\n')
f.close()
'''

如果源文件的示例输入句子为：

1# 5 # modern markets : confident consumers  # 0 0 0 0 0 
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa .  # 0 0 0 0 0 0 
!@#$%

输出将会像这样：

1# 5 # modern markets : confident consumers  # 1 2 3 4 5 
1# 6 # AddhUnIk baajaar : AshHvsHt upbhOkHtaa .  # 1 2 3 4 5 0 
!@#$%

输出解释：- 这实现了双向对齐。这意味着英语单词“modern”的第一个单词映射到印地语“AddhUnIk”的第一个单词，反之亦然。在这里，即使字符也被视为单词，因为它们也是双向映射的一个组成部分。因此，如果您观察印地语单词“。”具有空对齐，并且与英语句子相比，它不会映射到任何内容，因为它没有句号。输出中的第三行基本上表示一个分隔符，当我们正在处理多个句子时，我们正在尝试实现双向映射。

如果我有Unicode（UTF-8）格式的印地语句子，我应该进行哪些修改才能使其正常工作？

- boddhisattva

1

请编辑此问题并使用适当的格式，以使问题易读。 - Ignacio Vazquez-Abrams

3个回答

8

作为一般规则，不要处理编码的字节串：尽快将它们转换为适当的Unicode字符串（通过调用它们的 .decode 方法），始终在Unicode字符串上进行所有处理，然后，如果必须出于I/O目的，将其重新编码为所需的任何字节串编码。

如果你谈论字面值，正如你在你的代码中似乎是这样的，“尽快”就是“立即”：使用 u'...' 表示你的字面值。在更一般的情况下，当你被迫以编码形式进行 I/O 时，它应该在输入之后立即进行（就像在需要以特定编码形式执行输出时，在输出之前立即进行一样）。

- Alex Martelli

您好，感谢您的回答。您在第二段中所说的一切都适用于我的情况。我尝试在以下代码行中使用了这个方法： trans = u'Word(printables)' 但是我没有得到期望的输出。如果我修改的行不正确，请您指正。因为在做出这个更改后，关于定义语法的那些行会出现“在该位置期望printables”的错误提示。 - boddhisattva

@mgj，不要将Unicode字符串文字分配给 trans，这毫无意义。只需确保 printables 是一个Unicode对象（不是 utf8编码的字节字符串！-- 也不是任何其他编码的字节字符串！），并使用 trans = Word（printables）。如果你的文件是utf-8编码或使用任何其他编码，请使用codecs模块中的codecs.open进行解码，而不是像你现在做的那样使用内置的open，这样每个 line 就是一个Unicode对象，而不是字节字符串（以任何编码形式）。 - Alex Martelli

1

我正在搜索关于法语Unicode字符的内容，然后发现了这个问题。如果你搜索法语或其他拉丁语重音符号，使用pyparsing 2.3.0，你可以这样做：

>>> pp.pyparsing_unicode.Latin1.alphas
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzªµºÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ'

- snoob dogg

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- PaulMcG · Accepted Answer

Pyparsing的printables仅处理ASCII字符范围内的字符串。如果你想要包含完整Unicode范围内的可打印字符，可以这样写：

unicodePrintables = u''.join(unichr(c) for c in xrange(sys.maxunicode) 
                                        if not unichr(c).isspace())

现在你可以使用这个更完整的非空字符集来定义trans：

trans = Word(unicodePrintables)

我无法针对您的印地语测试字符串进行测试，但我认为这个方法会奏效。

(如果您使用的是Python 3，则没有单独的unichr函数和xrange生成器，只需使用：)

unicodePrintables = ''.join(chr(c) for c in range(sys.maxunicode) 
                                        if not chr(c).isspace())

编辑：

随着pyparsing 2.3.0的最新发布，新的命名空间类已被定义，以提供各种Unicode语言范围的printables，alphas，nums和alphanums。

import pyparsing as pp
pp.Word(pp.pyparsing_unicode.printables)
pp.Word(pp.pyparsing_unicode.Devanagari.printables)
pp.Word(pp.pyparsing_unicode.देवनागरी.printables)