如何在Python中将字符串中的Unicode字符替换为其他内容？

Question

如何在Python中将字符串中的Unicode字符替换为其他内容？

pythonunicode

55

我有一个字符串，它是从包含像“•”这样的符号的项目符号的HTML网页中读取的。请注意，这段文本是使用Python 2.7的urllib2.read（webaddress）从网页源代码获取的。

我知道圆点字符的Unicode字符是U+2022，但我该如何用其他内容替换这个Unicode字符呢？

我尝试过使用str.replace("•", "something")，但似乎没有起作用...我该怎么做？

- Rolando

字符串的类型是什么，你正在使用哪个版本的Python？ - Fred Foo

我正在使用Python 2.7，字符串是由urllib2.read()生成的。 - Rolando

抱歉，我现在不会使用urllib2下载网页。这个“type”是什么？是“str”还是“unicode”？ - Fred Foo

@Damascusi: type(str). 不幸的是，在Python 2.x中没有“普通字符串”类型；有两种字符串类型。 - Fred Foo

1

如果你的Python代码包含UTF-8字符，你应该在代码的第一行或第二行使用“魔法注释”# coding=utf8。 - Kinjal Dixit

显示剩余3条评论

7个回答

16

将字符串编码为Unicode。

>>> special = u"\u2022"
>>> abc = u'ABC•def'
>>> abc.replace(special,'X')
u'ABCXdef'

- RParadox

什么是“special”？我收到了一个NameError错误：名称'special'未定义。 - Rolando

@Rolando 注意，字符串前面加上了'u'，这使它成为了Unicode字符串。 - igauravsehrawat

6

试试这个。

你将会得到一个普通字符串的输出。

str.encode().decode('unicode-escape')

之后，您可以执行任何替换操作。

str.replace('•','something')

- Rahul Kumar Gupta

当\u序列实际上以原样出现在源字符串中时，这将证明非常有用。 - asu

3

import re
regex = re.compile("u'2022'",re.UNICODE)
newstring = re.sub(regex, something, yourstring, <optional flags>)

- David

1

这不是一个星号，它是一个圆点（圆形）。 - Rolando

尝试使用 re.sub(u'2022', varcontainingstring, '') 时，它会使字符串变为空，什么都没有。 - Rolando

@Damascusi 已修复 - 请现在尝试。 - David

@NullUserException 为什么使用正则表达式替换固定字符串是一个不好的主意？ - Teodor Anton

@AntonTeodor 正则表达式比简单的字符串搜索和替换效率低。但它仍然可以工作。 - NullUserException

-1

str1 = "This is Python\u500cPool"

将字符串编码为ASCII，并用“?”替换所有UTF-8字符。

str1 = str1.encode("ascii", "replace")

将字节流解码为字符串。

str1 = str1.decode(encoding="utf-8", errors="ignore")

用所需的字符替换问号。

str1 = str1.replace("?"," ")

- Ayushman Verma

-2

有趣的是答案隐藏在众多答案中。

str.replace("•", "something")

如果使用正确的语义，它将起作用。

str.replace(u"\u2022","something")

非常有效，感谢 RParadox 的提示。

- Mafketel

-2

如果你想要移除所有的 \u 字符，以下是代码供您参考。

def replace_unicode_character(self, content: str):
    content = content.encode('utf-8')
    if "\\x80" in str(content):
        count_unicode = 0
        i = 0
        while i < len(content):
            if "\\x" in str(content[i:i + 1]):
                if count_unicode % 3 == 0:
                    content = content[:i] + b'\x80\x80\x80' + content[i + 3:]
                i += 2
                count_unicode += 1
            i += 1

        content = content.replace(b'\x80\x80\x80', b'')
    return content.decode('utf-8')

- Khánh Pluto

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Fred Foo · Accepted Answer

将字符串解码为Unicode。假定它是UTF-8编码的：

str.decode("utf-8")

调用replace方法，并确保将Unicode字符串作为其第一个参数传递：

str.decode("utf-8").replace(u"\u2022", "*")

如有必要，将其重新编码为UTF-8：

str.decode("utf-8").replace(u"\u2022", "*").encode("utf-8")

幸运的是，Python 3 停止了这种混乱。第三步应该只在 I/O 之前执行。还要注意，将字符串命名为str会掩盖内置类型str。