正则表达式Python中使用Unicode（日语）字符时出现问题

Question

正则表达式Python中使用Unicode（日语）字符时出现问题

7

我希望删除下面字符串中的一部分（加粗显示），该字符串存储在oldString中。

[DMSM-8433] 加護亜依 Kago Ai – 加護亜依 vs. FRIDAY

我正在使用Python中的以下正则表达式。

p=re.compile(ur"( [\W]+) (?=[A-Za-z ]+–)", re.UNICODE)
newString=p.sub("", oldString)

当我输出newString时，没有任何内容被删除。

- Paul Thomas

oldString 应该转换为 Unicode。它已经转换成了吗？你如何获得它？在声明 p 之前尝试使用 oldString = unicode(oldString, "utf-8")。 - Wiktor Stribiżew

你期望的输出是什么？ - Mazdak

@stribizhev 我在文件顶部指定了 # -*- coding: utf-8 -*-，根据我所读的内容，这应该将其转换为Unicode。我从一个HTML页面中获取它。 @Kasramvd 预期输出应该是 "[DMSM-8433] Kago Ai – 加護亜依 vs. FRIDAY"。 - Paul Thomas

尝试这个代码片段。 - Wiktor Stribiżew

相关链接：https://dev59.com/l2Up5IYBdhLWcg3wz54H#15034560 - nhahtdh

@stribizhev，这似乎非常有效，谢谢！ - Paul Thomas

2个回答

1

我认为你应该使用这样的正则表达式：

([\p{Hiragana}\p{Katakana}\p{Han}]+)

请同时参考此文档。

编辑：我也在这里测试过了。

- teoreda

1

Python re不支持Unicode属性。当然，有regex包，但你需要在回答中提到它。（另外，我不太确定上面的语法是否会被regex包接受） - nhahtdh

这个似乎可以在PHP中工作，但在Python中却不行。当通过Python运行时，它会从“Kago Ai”中去掉“Kag”和“i”。 - Paul Thomas

@nhahtdh 目前正在使用 re 包，没有意识到还有另一个包。我会通读链接中的内容。 - Paul Thomas

请使用正确的设置进行测试：https://regex101.com/r/oE0oL5/3。结果会有很大差异。 - Wiktor Stribiżew

我同意你的看法 @stribizhev - teoreda

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Wiktor Stribiżew · Accepted Answer

您可以使用以下代码片段来解决此问题：

#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
str = u'[DMSM-8433] 加護亜依 Kago Ai – 加護亜依 vs. FRIDAY'
regex = u'[\u3000-\u303f\u3040-\u309f\u30a0-\u30ff\uff00-\uff9f\u4e00-\u9faf\u3400-\u4dbf]+ (?=[A-Za-z ]+–)'
p = re.compile(regex, re.U)
match = p.sub("", str)
print match.encode("UTF-8")

请查看IDEONE演示

除了声明# -*- coding: utf-8 -*-之外，我还添加了@nhahtdh的字符类来检测日语符号。

请注意，match需要手动编码为UTF-8字符串，因为Python 2需要“提醒”我们一直在使用Unicode。