在Python中从字符串中删除表情符号

Question

在Python中从字符串中删除表情符号

pythonstringunicodespecial-charactersemoji

93

我在Python中发现了一个用于删除表情符号的代码，但它不起作用。你能帮忙提供其他代码或修复这个代码吗？

我发现所有的表情符号都以\xf开头，但当我尝试搜索str.startswith("\xf")时，会出现无效字符错误。

emoji_pattern = r'/[x{1F601}-x{1F64F}]/u'
re.sub(emoji_pattern, '', word)

这里是错误：

Traceback (most recent call last):
  File "test.py", line 52, in <module>
    re.sub(emoji_pattern,'',word)
  File "/usr/lib/python2.7/re.py", line 151, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/usr/lib/python2.7/re.py", line 244, in _compile
    raise error, v # invalid expression
sre_constants.error: bad character range

在列表中的每个项都可以是一个单词 ['This', 'dog', '\xf0\x9f\x98\x82', 'https://t.co/5N86jYipOI']

更新：我使用了另一段代码：

emoji_pattern=re.compile(ur" " " [\U0001F600-\U0001F64F] # emoticons \
                                 |\
                                 [\U0001F300-\U0001F5FF] # symbols & pictographs\
                                 |\
                                 [\U0001F680-\U0001F6FF] # transport & map symbols\
                                 |\
                                 [\U0001F1E0-\U0001F1FF] # flags (iOS)\
                          " " ", re.VERBOSE)

emoji_pattern.sub('', word)

但这仍然没有移除表情符号并显示它们！有任何线索为什么会这样吗？

- Mona Jalal

3

Emoji字符不仅限于单个范围（请参阅此字符列表）。 - 一二三

1

你的表情符号不以\xf开头。你可能在UTF-8中看到了表示该字符串的字节，而第一个字节是0xf0。 - roeland

1

相关：使用Python中的re删除Unicode表情符号 - jfs

请查看以下链接，因为所选答案存在错误： https://stackoverflow.com/questions/52464119/removing-emoji-from-text-remove-also-japanese-langauge/52464600#52464600 - Sion C

27个回答

70

完整版本的移除表情符号
✍

import re
def remove_emojis(data):
    emoj = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
    return re.sub(emoj, '', data)

- Karim Omaya

它运行良好，谢谢。但对我来说，它没有移除这个图标：⏪。 - Abdel

1

这会删除一些阿拉伯字母，从而破坏阿拉伯文本。请建议。 - R.A

4

这个有效，但是： u"\U00002702-\U000027B0" 是重复的，u"\U000024C2-\U0001F251" 已经包含了范围 u"\U00002500-\U00002BEF" 和 u"\U00002702-\U000027B0"。另外 u"\U00010000-\U0010ffff" 已经包含了所有 5 位或 5 位以上的内容, 而 u"\u2600-\u2B55" 已经包含了 u"\u2640-\u2642"。所以这个答案可以更短、更简洁。 - lateus

56

我更新了我的回答，因为我的先前回答未考虑到其他Unicode标准，例如拉丁文、希腊文等。StackOverFlow不允许我删除先前的回答，因此我正在更新它以匹配最受欢迎的问题答案。

#!/usr/bin/env python
import re

text = u'This is a smiley face \U0001f602'
print(text) # with emoji

def deEmojify(text):
    regrex_pattern = re.compile(pattern = "["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags = re.UNICODE)
    return regrex_pattern.sub(r'',text)

print(deEmojify(text))

这是我之前的答案，请不要使用它。

def deEmojify(inputString):
    return inputString.encode('ascii', 'ignore').decode('ascii')

- Abdul-Razak Adam

37

这会将所有非ASCII字符都去除，而且这样做非常低效（为什么不只使用“inputString.encode('ascii', 'ignore').decode('ascii')”一步就完成呢？）。Unicode标准比Emoji更多，您不能只剥离拉丁文、希腊文、韩文字母、缅甸文、藏文、埃及文或任何其他Unicode支持的文字系统来移除Emoji。 - Martijn Pieters

2

@MonaJalal：该字符串实际上并不是Unicode（它是代表实际Unicode UTF-8编码的原始字节）。即使解码，它也没有表情符号（这些字节解码为右和左“智能引号”）。如果这解决了您的问题，那么您的问题不是您所问的问题；这将删除所有非ASCII字符（包括简单的重音e，é），而不仅仅是表情符号。 - ShadowRanger

这将删除除表情符号之外的其他语言字符。是否有其他方法仅删除表情符号？@MartijnPieters - Ishara Malaviarachchi

2

@IsharaMalaviarachchi：我写了一个回答，解决了另一个问题，即如何从多语言Unicode文本中删除表情符号：Remove Emoji's from multilingual Unicode text。 - Martijn Pieters

你好，我用了这个来删除表情符号，但是由于我在处理土耳其语内容，它也一并删除了土耳其字母，如 ş、ı、ğ、ç、ö、ü。请问是否有方法可以避免这种情况发生？ - Berkehan

显示剩余5条评论

26

如果您不想使用正则表达式，最好的解决方案可能是使用emoji python package。以下是一个简单的函数，用于返回无表情符号的文本（感谢这个SO答案）：

import emoji
def give_emoji_free_text(text):
    allchars = [str for str in text.decode('utf-8')]
    emoji_list = [c for c in allchars if c in emoji.UNICODE_EMOJI]
    clean_text = ' '.join([str for str in text.decode('utf-8').split() if not any(i in str for i in emoji_list)])
    return clean_text

如果您正在处理包含表情符号的字符串，这很简单。

>> s1 = "Hi  How is your  and . Have a nice weekend "
>> print s1
Hi  How is your  and . Have a nice weekend 
>> print give_emoji_free_text(s1)
Hi How is your and Have a nice weekend

如果你处理Unicode（如@jfs的示例），只需使用UTF-8进行编码。

>> s2 = u'This dog \U0001f602'
>> print s2
This dog 
>> print give_emoji_free_text(s2.encode('utf8'))
This dog

修改

根据评论，应该很容易：

def give_emoji_free_text(text):
    return emoji.get_emoji_regexp().sub(r'', text.decode('utf8'))

- kingmakerking

12

这个项目有更好的表现：它包括一个正则表达式生成函数。使用emoji.get_emoji_regexp().sub(r'', text.decode('utf8'))就可以完成。不要一个一个字符地迭代，那样非常低效。 - Martijn Pieters

这不适用于 ♕ ♔NAFSET ♕。也许这些字符不是表情符号。 - heyxh

11

如果text已经被解码，Edits中的代码将会抛出一个错误。此时，返回语句应该是return emoji.get_emoji_regexp().sub(r'', text)，我们会去掉不必要的.decode('utf8')部分。 - Pedram

4

emoji包有专门用于替换表情符号的内部函数 - emoji.replace_emoji(str, replace='')。 - Ernest

19

完整版本以删除表情符号：

import re
def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

- Ali Tavakoli

你能更具体地解释一下，通过添加注释（像其他部分一样）你提供了什么额外的内容吗？ - malioboro

1

这并不是一个完美的解决方案，因为Unicode 9.0表情符号没有包含在模式中。Unicode 10.0或11.0也没有包含在内。你只需要不断更新模式。 - Martijn Pieters

@MartijnPieters 请看下面的回答！ - KevinTydlacka

@KevinTydlacka：那也不是一个好的方法。请参见我的最近的回答，该回答依赖于第三方库提供更新的正则表达式。 - Martijn Pieters

18

如果您正在使用已接受答案中的示例，但仍然收到“坏字符范围”错误，则可能是因为您正在使用窄版构建（有关详细信息，请参见此答案）。一个看起来可行的正则表达式的重新编排版本如下：

emoji_pattern = re.compile(
    u"(\ud83d[\ude00-\ude4f])|"  # emoticons
    u"(\ud83c[\udf00-\uffff])|"  # symbols & pictographs (1 of 2)
    u"(\ud83d[\u0000-\uddff])|"  # symbols & pictographs (2 of 2)
    u"(\ud83d[\ude80-\udeff])|"  # transport & map symbols
    u"(\ud83c[\udde0-\uddff])"  # flags (iOS)
    "+", flags=re.UNICODE)

- scwagner

16

采纳的答案以及其他对我有所帮助，但我最终决定剥离基本多语言平面之外的所有字符（Basic Multilingual Plane）。这将排除未来添加到其他Unicode平面的内容（例如表情符号等），这意味着我不必每次添加新的Unicode字符时都更新我的代码 :）。

如果您的文本尚未转换为Unicode，请在Python 2.7中进行转换，然后使用以下负正则表达式替换任何不在BMP之内的内容（这包括从BMP之外的所有字符，但保留用于创建2字节补充多语言平面字符的代理项）。

NON_BMP_RE = re.compile(u"[^\U00000000-\U0000d7ff\U0000e000-\U0000ffff]", flags=re.UNICODE)
NON_BMP_RE.sub(u'', unicode(text, 'utf-8'))

- KevinTydlacka

谢谢你的分享。上面的范围并没有过滤像这个一样的字符：我甚至不知道这是什么，因为我在浏览器中看不到它，希望它不是什么冒犯性的东西:D - Teddy Markov

这是最强大的答案。对于Python 3，最后一行变成了cleaned_text = NON_BMP_RE.sub(u"", text)。 - pir

9

我能通过以下方式消除表情符号：

表情符号安装 https://pypi.org/project/emoji/

$ pip3 install emoji

import emoji

def remove_emoji(string):
    return emoji.get_emoji_regexp().sub(u'', string)

emojis = '(｀ヘ´) ⭕⭐⏩'
print(remove_emoji(emojis))

## Output result
(｀ヘ´)

- jojo

1

出现错误: AttributeError: 模块 'emoji' 没有 'get_emoji_regexp' 属性 - FoundABetterName

1

@FoundABetterName 使用 emoji == 1.7.3。'get_emoji_regexp' 在最近的版本中已被弃用。 - ZooPanda

请使用emoji.replace_emoji(text, replace='')来替换文本中的表情符号。 - undefined

9

我试图收集完整的Unicode列表，用它来从推文中提取表情符号，这对我非常有效。

# Emojis pattern
emoji_pattern = re.compile("["
                u"\U0001F600-\U0001F64F"  # emoticons
                u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                u"\U0001F680-\U0001F6FF"  # transport & map symbols
                u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                u"\U00002702-\U000027B0"
                u"\U000024C2-\U0001F251"
                u"\U0001f926-\U0001f937"
                u'\U00010000-\U0010ffff'
                u"\u200d"
                u"\u2640-\u2642"
                u"\u2600-\u2B55"
                u"\u23cf"
                u"\u23e9"
                u"\u231a"
                u"\u3030"
                u"\ufe0f"
    "]+", flags=re.UNICODE)

- Chiheb.K

无法处理文本 = u'This dog \xe2\x80\x9d \xe2\x80\x9c' - Mona Jalal

print "\xe2\x80\x9d".decode("utf-8") ” print "\xe2\x80\x9c".decode("utf-8") “ 而你是在询问如何移除表情符号还是特殊字符？

- Chiheb.K

不会移除 ⏰ - Computer's Guy

我使用这个程序从Twitter流中删除所有的表情符号。你的案例是什么？输入、输出？ - Chiheb.K

8

使用 Demoji 包，https://pypi.org/project/demoji/。

import demoji

text=""
emoji_less_text = demoji.replace(text, "")

- user9225268

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jfs · Accepted Answer

在 Python 2 中，您需要使用 u'' 字符串字面值来创建 Unicode 字符串。另外，您应该传递 re.UNICODE 标志并将输入数据转换为 Unicode（例如，text = data.decode('utf-8')）:

#!/usr/bin/env python
import re

text = u'This dog \U0001f602'
print(text) # with emoji

emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags=re.UNICODE)
print(emoji_pattern.sub(r'', text)) # no emoji

输出

This dog 
This dog

注意：emoji_pattern只匹配一些表情符号（不是全部）。请参见哪些字符是表情符号。