regexp_tokenize和阿拉伯文本

Question

regexp_tokenize和阿拉伯文本

3

我正在使用regexp_tokenize()从一段阿拉伯文本中返回没有任何标点符号的单词。

import re,string,sys
from nltk.tokenize import  regexp_tokenize

def PreProcess_text(Input):
  tokens=regexp_tokenize(Input, r'[،؟!.؛]\s*', gaps=True)
  return tokens

H = raw_input('H:')
Cleand= PreProcess_text(H)
print  '\n'.join(Cleand)

它运行得很好，但问题是当我尝试打印文本时。

文本ايمان،سعد的输出：

    ?يم
    ?ن
    ?
    ?
    ?

但如果文字是英文的，即使有阿拉伯标点符号，它也会正确打印出结果。

对于文本hi،eman的输出：

     hi
     eman

- Eman

你的阿拉伯文本的预期输出是什么？ - NullUserException

你在使用Python 2.x，对吧？在Python 3.4中，当我输入 ايمان،سعد，我得到的是 ايمان 和 سعد。 - Wiktor Stribiżew

请使用@+用户名来通知用户您的反馈。我建议使用u前缀：ur'[\u060C\u061F!.\u061B]\s*'，不要传递只有H - 尝试使用unicode(H, "utf-8")或者H.decode('utf8')。 - Wiktor Stribiżew

@WiktorStribiżew 首先我尝试使用unicode(H, "utf-8")或H.decode('utf8')，但打印时出现错误。我认为解决方案是切换到Python 3。如果您知道如何在Mac上操作，那将非常有帮助。谢谢。 - Eman

@WiktorStribiżew 非常感谢您的帮助，一些方法 H.decode('utf8') 完美地解决了问题！！！再次感谢您。 - Eman

显示剩余4条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Wiktor Stribiżew · Accepted Answer

当您使用raw_input时，符号会编码为字节。

您需要将其转换为Unicode字符串，方法如下：

H.decode('utf8')

你可以保留你的正则表达式：

tokens=regexp_tokenize(Input, r'[،؟!.؛]\s*', gaps=True)