包含Unicode字符的单词如何拆分？

Question

包含Unicode字符的单词如何拆分？

3

我正在从事一个涉及推文中表情符号的NLP项目。

以下是推文示例：
"sometimes i wish i wa an octopus so i could slap 8 people at once" 我的问题是，once被视为一个单词，因此我想将这个独特的单词拆分成两个，使我的推文看起来像这样：
"sometimes i wish i wa an octopus so i could slap 8 people at once " 请注意，我已经有了包含每个表情符号的编译正则表达式！

由于我有数十万条推文，因此我正在寻找一种有效的方法来完成这项工作，但我无法确定从哪里开始。

谢谢。

- Thomas Reynaud

2个回答

1

你可以使用re.sub来引入一个空格：

re.sub(r'(\W+)(?= |$)', r' \1', string)

例子：

>>> string
'sometimes i wish i wa an octopus so i could slap 8 people at once\xf0\x9f\x90\x99'
>>> re.sub(r'(\W+)(?= |$)', r' \1', string)
'sometimes i wish i wa an octopus so i could slap 8 people at once \xf0\x9f\x90\x99'

>>> string = 'sometimes i wish i wa an octopus so i could slap 8 people at once" foobar'
>>> re.sub(r'(\W+)(?= |$)', r' \1', string)
'sometimes i wish i wa an octopus so i could slap 8 people at once \xf0\x9f\x90\x99 foobar'

- heemayl

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- L3viathan · Accepted Answer

你可以这样做吗：

>>> import re
>>> s = "sometimes i wish i wa an octopus so i could slap 8 people at once"
>>> re.findall("(\w+|[^\w ]+)",s)
['sometimes', 'i', 'wish', 'i', 'wa', 'an', 'octopus', 'so', 'i', 'could', 'slap', '8', 'people', 'at', 'once', '']

如果你需要将它们再次作为一个由空格分隔的字符串，只需将它们连接起来即可：

>>> " ".join(re.findall("(\w+|[^\w ]+)",s))
'sometimes i wish i wa an octopus so i could slap 8 people at once '

编辑：已修复。