在Python中分割并计算给定字符串中的表情符号和单词数。

Question

在Python中分割并计算给定字符串中的表情符号和单词数。

pythonpython-3.xunicodecounteremoji

3

针对给定的字符串，我想要统计每个单词和表情符号出现的次数。我已经在这里对仅由一个表情符号组成的表情符号进行了统计（链接）。但问题在于很多当前的表情符号是由几个表情符号组成的。

比如表情符号 ‍‍‍ 是由四个表情符号—— ‍、‍、‍和带有人类肤色的表情符号——组成的，例如等。

问题归结为如何正确分割字符串，接下来进行计数就很容易了。

关于这个问题，有一些好的解答，比如链接1 和链接2，但没有一个适用于一般的解决方案（或者解决方案已过时，或者我只是无法理解）。

举个例子，如果字符串是 hello ‍ emoji hello ‍‍‍，那么我会得到 {'hello':2, 'emoji':1, '‍‍‍':1, '‍':1}。我的字符串来自WhatsApp，并且都是以UTF8编码的。

我尝试了很多次，但效果不佳。非常感谢你的帮助！

- sheldonzy

3个回答

2

使用第三方的正则表达式模块regex，它支持识别字形簇（Unicode代码点序列渲染为单个字符）：

>>> import regex
>>> s='‍‍‍'
>>> regex.findall(r'\X',s)
['\u200d\u200d\u200d', '']
>>> for c in regex.findall('\X',s):
...     print(c)
... 
‍‍‍

计算它们的数量：

>>> data = regex.findall(r'\X',s)
>>> from collections import Counter
>>> Counter(data)
Counter({'\u200d\u200d\u200d': 1, '': 1})

- Mark Tolonen

谢谢。当我在这个字符串中包含文本时，我应该怎么做？因为当字符串中有单词时，它也会计算所有字母的数量。 - sheldonzy

@sheldonzy 这更加困难，因为正如你所看到的，表情符号是复杂的，它们不仅由 Unicode 的表情符号范围内的严格代码点组成。 - Mark Tolonen

好的，谢谢。我已经将完整的函数添加为另一个答案。不确定这是否是最佳解决方案，但目前它能够正常工作。 - sheldonzy

0

emoji.UNICODE_EMOJI是一个具有结构的字典

{'en': 
    {'': ':1st_place_medal:',
     '': ':2nd_place_medal:',
     '': ':3rd_place_medal:' 
... }
}

因此，您需要使用emoji.UNICODE_EMOJI['en']才能使上述代码正常工作。

- William Egesdal

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- sheldonzy · Accepted Answer

感谢Mark Tolonen。现在，为了统计给定字符串中的单词和表情符号以及单词数，我将使用emoji.UNICOME_EMOJI来确定什么是表情符号和什么不是（从emoji包中），然后从字符串中删除表情符号。

当前答案并不完美，但可行，如果有改变我会进行编辑。

import emoji
import regex
def split_count(text):
    total_emoji = []
    data = regex.findall(r'\X',text)
    flag = False
    for word in data:
        if any(char in emoji.UNICODE_EMOJI for char in word):  
            total_emoji += [word] # total_emoji is a list of all emojis

    # Remove from the given text the emojis
    for current in total_emoji:
        text = text.replace(current, '') 

    return Counter(text.split() + total_emoji)


text_string = "here hello world hello‍‍‍"    
final_counter = split_count(text_string)

输出：

final_counter
Counter({'hello': 2,
         'here': 1,
         'world': 1,
         '\u200d\u200d\u200d': 1,
         '': 5,
         '': 1})