Python- 如何在表格中显示常见单词并跳过特定单词

3

目前我正在对一个文本文件进行频率分析,以显示该文件中使用最常见的前100个单词。 目前我正在使用以下代码:

from collections import Counter
import re
words = re.findall(r'\w+', open('tweets.txt').read().lower())
print Counter(words).most_common (100)

上述代码能够正常运行,输出结果如下:
[('the', 1998), ('t', 1829), ('https', 1620), ('co', 1604), ('to', 1247), ('and', 1053), ('in', 957), ('a', 899), ('of', 821), ('i', 789), ('is', 784), ('you', 753), ('will', 654), ('for', 601), ('on', 574), ('thank', 470), ('be', 455), ('great', 447), ('hillary', 440), ('we', 390), ('that', 373), ('s', 363), ('it', 346), ('with', 345), ('at', 333), ('me', 327), ('are', 311), ('amp', 290), ('clinton', 288), ('trump', 287), ('have', 286), ('our', 264), ('realdonaldtrump', 256), ('my', 244), ('all', 237), ('crooked', 236), ('so', 233), ('by', 226), ('this', 222), ('was', 217), ('people', 216), ('has', 210), ('not', 210), ('just', 210), ('america', 204), ('she', 190), ('they', 188), ('trump2016', 180), ('very', 180), ('make', 180), ('from', 175), ('rt', 170), ('out', 169), ('he', 168), ('her', 164), ('makeamericagreatagain', 164), ('join', 161), ('as', 158), ('new', 157), ('who', 155), ('again', 154), ('about', 145), ('no', 142), ('get', 138), ('more', 137), ('now', 136), ('today', 136), ('president', 135), ('can', 134), ('time', 123), ('media', 123), ('vote', 117), ('but', 117), ('am', 116), ('bad', 116), ('going', 115), ('maga', 112), ('u', 112), ('many', 110), ('if', 110), ('country', 108), ('big', 108), ('what', 107), ('your', 105), ('cnn', 105), ('never', 104), ('one', 101), ('up', 101), ('back', 99), ('jobs', 98), ('tonight', 97), ('do', 97), ('been', 97), ('would', 94), ('obama', 93), ('tomorrow', 88), ('said', 88), ('like', 88), ('should', 87), ('when', 86)]

然而,我想以表格形式显示它,其中包括标题“Word”和“Count”。我尝试使用prettytable包,并得到以下结果:

from collections import Counter
import re
import prettytable

words = re.findall(r'\w+', open('tweets.txt').read().lower())

for label, data in ('Word', words):
    pt = prettytable(field_names=[label, 'Count'])
    c = Counter(data)
    [pt.add_row(kv) for kv in c.most_common() [:100] ]
    pt.align [label], pt.align['Count'] = '1', 'r'
    print pt

我收到了一个 ValueError: too many values to unpack 的错误。我的问题是,我的代码有什么问题?是否有办法使用 prettytable 显示数据?另外,如何修复我的代码?

额外的问题:在计算频率时是否有办法跳过某些词?例如跳过单词:and、if、of 等等。

谢谢。


错误出现在哪一行?更新问题。 - Peter Wood
('Word', words) 是什么? - Peter Wood
错误在这一行:"for label, data in ('Word', words):" - Vin23
抱歉,我是Python的新手。Word是标题标签,“words”是单词本身(例如:they,make,get等等)。 - Vin23
2个回答

2
你是在尝试做什么吗?
from prettytable import PrettyTable

x = PrettyTable(["Words", "Counts"])

L = [('the', 1998), ('t', 1829), ('https', 1620), ('co', 1604), ('to', 1247), ('and', 1053), ('in', 957), ('a', 899), ('of', 821), ('i', 789), ('is', 784), ('you', 753), ('will', 654), ('for', 601), ('on', 574), ('thank', 470), ('be', 455), ('great', 447), ('hillary', 440), ('we', 390), ('that', 373), ('s', 363), ('it', 346), ('with', 345), ('at', 333), ('me', 327), ('are', 311), ('amp', 290), ('clinton', 288), ('trump', 287), ('have', 286), ('our', 264), ('realdonaldtrump', 256), ('my', 244), ('all', 237), ('crooked', 236), ('so', 233), ('by', 226), ('this', 222), ('was', 217), ('people', 216), ('has', 210), ('not', 210), ('just', 210), ('america', 204), ('she', 190), ('they', 188), ('trump2016', 180), ('very', 180), ('make', 180), ('from', 175), ('rt', 170), ('out', 169), ('he', 168), ('her', 164), ('makeamericagreatagain', 164), ('join', 161), ('as', 158), ('new', 157), ('who', 155), ('again', 154), ('about', 145), ('no', 142), ('get', 138), ('more', 137), ('now', 136), ('today', 136), ('president', 135), ('can', 134), ('time', 123), ('media', 123), ('vote', 117), ('but', 117), ('am', 116), ('bad', 116), ('going', 115), ('maga', 112), ('u', 112), ('many', 110), ('if', 110), ('country', 108), ('big', 108), ('what', 107), ('your', 105), ('cnn', 105), ('never', 104), ('one', 101), ('up', 101), ('back', 99), ('jobs', 98), ('tonight', 97), ('do', 97), ('been', 97), ('would', 94), ('obama', 93), ('tomorrow', 88), ('said', 88), ('like', 88), ('should', 87), ('when', 86)]


for e in L:
    x.add_row([e[0],e[1]])

print x

以下是结果:

+-----------------------+--------+
|         Words         | Counts |
+-----------------------+--------+
|          the          |  1998  |
|           t           |  1829  |
|         https         |  1620  |
|           co          |  1604  |
|           to          |  1247  |
|          and          |  1053  |
|           in          |  957   |
|           a           |  899   |
|           of          |  821   |
|           i           |  789   |
|           is          |  784   |
|          you          |  753   |
|          will         |  654   |
|          for          |  601   |
|           on          |  574   |
|         thank         |  470   |
|           be          |  455   |
|         great         |  447   |
|        hillary        |  440   |
|           we          |  390   |
|          that         |  373   |
|           s           |  363   |
|           it          |  346   |
|          with         |  345   |
|           at          |  333   |
|           me          |  327   |
|          are          |  311   |
|          amp          |  290   |
|        clinton        |  288   |
|         trump         |  287   |
|          have         |  286   |
|          our          |  264   |
|    realdonaldtrump    |  256   |
|           my          |  244   |
|          all          |  237   |
|        crooked        |  236   |
|           so          |  233   |
|           by          |  226   |
|          this         |  222   |
|          was          |  217   |
|         people        |  216   |
|          has          |  210   |
|          not          |  210   |
|          just         |  210   |
|        america        |  204   |
|          she          |  190   |
|          they         |  188   |
|       trump2016       |  180   |
|          very         |  180   |
|          make         |  180   |
|          from         |  175   |
|           rt          |  170   |
|          out          |  169   |
|           he          |  168   |
|          her          |  164   |
| makeamericagreatagain |  164   |
|          join         |  161   |
|           as          |  158   |
|          new          |  157   |
|          who          |  155   |
|         again         |  154   |
|         about         |  145   |
|           no          |  142   |
|          get          |  138   |
|          more         |  137   |
|          now          |  136   |
|         today         |  136   |
|       president       |  135   |
|          can          |  134   |
|          time         |  123   |
|         media         |  123   |
|          vote         |  117   |
|          but          |  117   |
|           am          |  116   |
|          bad          |  116   |
|         going         |  115   |
|          maga         |  112   |
|           u           |  112   |
|          many         |  110   |
|           if          |  110   |
|        country        |  108   |
|          big          |  108   |
|          what         |  107   |
|          your         |  105   |
|          cnn          |  105   |
|         never         |  104   |
|          one          |  101   |
|           up          |  101   |
|          back         |   99   |
|          jobs         |   98   |
|        tonight        |   97   |
|           do          |   97   |
|          been         |   97   |
|         would         |   94   |
|         obama         |   93   |
|        tomorrow       |   88   |
|          said         |   88   |
|          like         |   88   |
|         should        |   87   |
|          when         |   86   |
+-----------------------+--------+

编辑1: 如果你想省略某些内容,可以像这样做:

for e in L:
    if e[0]!="and" or e[0]!="if" or e[0]!="of":
        x.add_row([e[0],e[1]])

编辑2:总结:

涉及IT技术相关内容。
from collections import Counter
import re

words = re.findall(r'\w+', open('tweets.txt').read().lower())
counts = Counter(words).most_common (100)

from prettytable import PrettyTable

x = PrettyTable(["Words", "Counts"])

skip_list = ['and','if','or'] # see joe's comment

for e in counts:
    if e[0] not in skip_list:
        x.add_row([e[0],e[1]])

print x

是的,类似这样。但是有没有可能不要有长长的不同单词列表? - Vin23
你可以定义 skip_list = ['and', 'if', 'or']if e[0] not in skip_list: - Joe Bobson
当然,为什么我没想到这个……如果你想省略特定的单词,Joe的答案更好。 - Loïc Poncin
抱歉我必须承认,我真的不知道如何帮助你不使用列表,这是我第一次使用正则表达式和集合。 - Loïc Poncin
离题:我在想,你为什么这样做?我知道心理学家使用这种方法来研究某些人的行为。或者你只是对编程感兴趣吗? - Loïc Poncin
显示剩余5条评论

2
我不确定你希望你编写的for循环如何工作。你遇到的错误是因为你试图迭代具有两个元素的元组('Word', words)。语句for label, data in ('Word', words)试图将'W'分配给label'o'分配给data,并在第一次迭代结束时剩下'r''d'。也许你想要将项目一起压缩?但是为什么你要为每个单词创建一个新表呢?
以下是重写后的版本:
from collections import Counter
import re, prettytable

words = re.findall(r'\w+', open('tweets.txt').read().lower())
c = Counter(words)
pt = prettytable.PrettyTable(['Words', 'Counts'])
pt.align['Words'] = 'l'
pt.align['Counts'] = 'r'
for row in c.most_common(100):
    pt.add_row(row)
print pt

如果要跳过最常见的元素,可以在调用most_common之前从计数器中简单地丢弃它们。一种简单的方法是定义一个无效单词列表,然后使用字典推导式将它们过滤掉:

bad_words = ['the', 'if', 'of']
c = Counter({k: v for k, v in c.items() if k not in bad_words})

另外,您可以在将其转换为计数器之前对单词列表进行筛选:

words = filter(lambda x: x not in bad_words, words)

我更喜欢在计数器上操作,因为这样需要的工作量比较小,因为数据已经被聚合了。这是参考的组合代码:

from collections import Counter
import re, prettytable

bad_words = ['the', 'if', 'of']
words = re.findall(r'\w+', open('tweets.txt').read().lower())

c = Counter(words)
c = Counter({k: v for k, v in c.items() if k not in bad_words})

pt = prettytable.PrettyTable(['Words', 'Counts'])
pt.align['Words'] = 'l'
pt.align['Counts'] = 'r'
for row in c.most_common(100):
    pt.add_row(row)

print(pt)

我从你的代码中得到了一个错误。文件“test4.py”,第7行,在<module>中 pt.set_field_names(["Words",“Counts"]) 在“C:\ Python27 \ lib \ site-packages \ prettytable.py”中的第217行,__getattr__函数出现了AttributeError: set_field_names。 - Vin23
@Vin23。我已经修复了那个问题。 - Mad Physicist
@Vin23。这个库的文档有点过时,我的第一个版本就是基于那个文档开发的。 - Mad Physicist
这个答案只比loics多一个优点,就是它在跳过单词后制作了一个100个最常见单词的表格,而不是之前。 - Mad Physicist

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接