Cython将字符串转换为Unicode

Question

Cython将字符串转换为Unicode

3

我正在尝试使用这个库 https://github.com/pytries/datrie 来处理中文文本。

但是我遇到了一个问题——它无法正确地编码和解码中文unicode：

import datrie
text = htmls_2_text(input_dir)
trie = datrie.Trie(''.join(set(text))) # about 2221 unique chars
trie['今天天气真好'] = 111
trie['今天好'] = 222
trie['今天'] = 444

print(trie.items())

[('今义', 444), ('今义义傲兢于', 111), ('今义于', 222)]

独特字符：https://pastebin.com/n2i280i8

显然结果是错误的，很明显存在解码/编码错误。

然后我查看源代码：https://github.com/pytries/datrie/blob/master/src/datrie.pyx

cdef cdatrie.AlphaChar* new_alpha_char_from_unicode(unicode txt):
    """
    Converts Python unicode string to libdatrie's AlphaChar* format.
    libdatrie wants null-terminated array of 4-byte LE symbols.
    The caller should free the result of this function.
    """
    cdef int txt_len = len(txt)
    cdef int size = (txt_len + 1) * sizeof(cdatrie.AlphaChar)

    # allocate buffer
    cdef cdatrie.AlphaChar* data = <cdatrie.AlphaChar*> malloc(size)
    if data is NULL:
        raise MemoryError()

    # Copy text contents to buffer.
    # XXX: is it safe? The safe alternative is to decode txt
    # to utf32_le and then use memcpy to copy the content:
    #
    #    py_str = txt.encode('utf_32_le')
    #    cdef char* c_str = py_str
    #    string.memcpy(data, c_str, size-1)
    #
    # but the following is much (say 10x) faster and this
    # function is really in a hot spot.
    cdef int i = 0
    for char in txt:
        data[i] = <cdatrie.AlphaChar> char
        i+=1

    # Buffer must be null-terminated (last 4 bytes must be zero).
    data[txt_len] = 0
    return data


cdef unicode unicode_from_alpha_char(cdatrie.AlphaChar* key, int len=0):
    """
    Converts libdatrie's AlphaChar* to Python unicode.
    """
    cdef int length = len
    if length == 0:
        length = cdatrie.alpha_char_strlen(key)*sizeof(cdatrie.AlphaChar)
    cdef char* c_str = <char*> key
    return c_str[:length].decode('utf_32_le')

我曾试图使用注释块txt.encode('utf_32_le')来替换当前更快的技巧，但都没有奏效。

我认为这段代码没什么问题，那么问题出在哪里呢？

- Mithril

2

你没有展示如何创建trie。这很重要，因为你必须向构造函数传递键的Unicode范围，以便它能够处理。 - user2390182

@schwobaseggl 谢谢你的提示，我发现萨不在唯一单词输入中，这就是为什么只有3个项目。更新问题。 - Mithril

有趣的观察：如果你用 trie = datrie.Trie(''.join(set(u'今天天气真好今天好今天'))) 来初始化它，问题就会消失。你可以尝试在“text”中找到一组最小汉字集合，如果将其删除，则问题得以解决。 - gmoss

顺便提一句，如果您设置

bad_chars = ' 伊亏外会两售嘴吸傻勇呈凑嚏坚凡周切》五嘟厢假嘿参准哈K俊何[也呣令剧k共味凶基喃嚅井军吞…伴为便临壁仅V促以噪俯兰号偶击、吉厌土冠吩劲伟人A净升双哎坟a几卧嗨仪壶固刀匆丈再伎堂化嘘“伞到值圾啊L发停啪l七历喊复冒加别催十G免城卑取付卡佩呯修危乳偷侃。分争圈吊唔优务匠侣"

，并使用''.join(set(text)-set(bad_chars))，这将起作用。再次声明，我不建议您删除这些单词，但这是一个有趣的数据点，可以帮助您开始调查错误。 - gmoss

@gmoss 我已经将唯一的单词分成几组进行测试。例如，1.unique_string[:478]没有错误 2.unique_string[:480]出现了错误 3.unique_string[478:480]没有错误 4.unique_string[460:500]没有错误。这非常令人困惑。PS：我已经切换到另一个库，但C代码中肯定存在某些错误。我只是想找出是什么原因导致了这个问题。 - Mithril

显示剩余4条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- gmoss · Accepted Answer

看起来问题在于这个datrie包最多支持255个键集中的字符值：https://github.com/pytries/datrie/blob/master/libdatrie/datrie/alpha-map.h#L59 我建议使用这里的marisa_trie :: RecordTrie：https://pypi.python.org/pypi/marisa-trie 不幸的是，它是一个静态数据结构，因此在构建后无法修改，但它完全支持unicode、序列化到磁盘以及各种值类型。

>>> from marisa_trie import RecordTrie
>>> rt = RecordTrie(">I", [(u'今天天气真好', (111,)), (u'今天好', (222,)), (u'今天', (444,))])
>>> for x in rt.items():
...     print x[0], x[1]
...
今天天气真好 (111,)
今天好 (222,)
今天 (444,)

（注意，此示例中使用的是Python 2.7，因此有u''和循环打印。）

编辑

如果您绝对必须使用datrie.Trie，可以以相当愚蠢的方式利用它：

def encode(s):
    return ''.join('%08x' % ord(x) for x in s)

def decode(s):
    return ''.join(chr(int(s[n:n+8], 16)) for n in range(0, len(s), 8))

>>> trie = datrie.Trie('0123456789abcdef')
>>> trie[encode('今天天气真好')] = 111
>>> trie[encode('今天好')] = 222
>>> trie[encode('今天')] = 444
>>> [decode(x) for x in trie.keys()]
['今天', '今天天气真好', '今天好']

我使用了数字8，因为32是任何utf8编码字符的最大位宽。您可以通过计算max(ord(x) for x in text)并将其用作填充来节省空间。或者，您可以想出自己的编码方案，最多使用255个字符值。这只是一种非常快速且低效的解决方案。

当然，这有点违背使用trie的初衷......