统计单词频率并生成字典

10

我希望从文本文件中提取每个单词,并在字典中计算单词频率。

例如:'this is the textfile, and it is used to take words and count'

d = {'this': 1, 'is': 2, 'the': 1, ...} 

我并不是离目标很远,但就是不知道该如何完成。以下是我的代码:

import sys

argv = sys.argv[1]
data = open(argv)
words = data.read()
data.close()
wordfreq = {}
for i in words:
    #there should be a counter and somehow it must fill the dict.

2
从这里开始:http://docs.python.org/2/library/collections.html#counter-objects。您还需要使用`split`函数来获取单个单词并删除任何标点符号,请参见:http://docs.python.org/2/library/stdtypes.html#string-methods。 - jonrsharpe
13个回答

16

如果您不想使用collections.Counter,您可以编写自己的函数:

import sys

filename = sys.argv[1]
fp = open(filename)
data = fp.read()
words = data.split()
fp.close()

unwanted_chars = ".,-_ (and so on)"
wordfreq = {}
for raw_word in words:
    word = raw_word.strip(unwanted_chars)
    if word not in wordfreq:
        wordfreq[word] = 0 
    wordfreq[word] += 1

如果要寻找更精细的东西,请看正则表达式。


2
如果楼主希望单词频率不区分大小写,他应该将大小写转换为大写(或小写)。 - amehta
如果单词不在wordfreq中,则将其频率设置为1。 - Pranav
1
@Pranav,我将它设置为0,因为它会在下一行递增。 - Don

13

虽然像 @Michael 建议的那样使用来自于 collections 库的 Counter 更好,但我添加这个答案只是为了改进你的代码。(我相信这会成为一个新的Python学习者的好答案。)

从你代码中的注释中看来,你想要改进你的代码。而且我认为你能够以单词形式读取文件内容(尽管通常我避免使用 read() 函数和类似 for line in file_descriptor: 这样的代码)。

由于 words 是一个字符串,在 for 循环中,for i in words: 循环变量 i 不是一个单词而是一个字符。你正在迭代字符串中的字符,而不是迭代字符串中的单词。为了理解这一点,请注意以下代码片段:

>>> for i in "Hi, h r u?":
...  print i
... 
H
i
,
 
h
 
r
 
u
?
>>> 

因为逐个字符地迭代给定的字符串而不是逐个单词地迭代并不是您希望实现的功能,要逐个单词地迭代,您应该在Python的字符串类中使用split方法/函数来实现。
str.split(str="", num=string.count(str)) 方法返回字符串中所有单词的列表,使用str作为分隔符(如果未指定,则拆分所有空格),可选限制拆分次数到num。

请注意下面的代码示例:

Split:

>>> "Hi, how are you?".split()
['Hi,', 'how', 'are', 'you?']

使用split的循环:

>>> for i in "Hi, how are you?".split():
...  print i
... 
Hi,
how
are
you?

看起来你需要的东西很不错。但是除了单词“Hi,”之外,因为split()默认会以空格拆分,所以“Hi,”会被保留为一个字符串(显然),而你不希望出现这种情况。

要计算文件中单词的频率,一个好的解决方案是使用正则表达式。但首先,为了保持答案简单,我将使用replace()方法。该方法str.replace(old, new[, max])返回字符串的副本,其中old的出现已被替换为new,并可选择将替换次数限制为max。

现在请查看下面的代码示例,看看我提出了什么建议:

>>> "Hi, how are you?".split()
['Hi,', 'how', 'are', 'you?'] # it has , with Hi
>>> "Hi, how are you?".replace(',', ' ').split()
['Hi', 'how', 'are', 'you?'] # , replaced by space then split

循环:

>>> for word in "Hi, how are you?".replace(',', ' ').split():
...  print word
... 
Hi
how
are
you?

现在,如何计算频率:

一种方法是像@Michael建议的那样使用Counter,但你可以采用从空字典开始的方法来实现。请参考下面的代码示例:

words = f.read()
wordfreq = {}
for word in .replace(', ',' ').split():
    wordfreq[word] = wordfreq.setdefault(word, 0) + 1
    #                ^^ add 1 to 0 or old value from dict 
因为最开始的时候wordfreq为空,所以你不能直接把它赋值给wordfreq[word],否则会出现键异常错误。所以我使用了setdefault字典方法。 dict.setdefault(key, default=None)get()类似,但如果键不在字典中,它将设置dict[key]=default。因此,当一个新单词第一次出现时,我使用setdefault将其设置为0,然后将其加上1并赋值到同一个字典中。
我已经编写了一个等效的代码,使用with open代替单个open
with open('~/Desktop/file') as f:
    words = f.read()
    wordfreq = {}
    for word in words.replace(',', ' ').split():
        wordfreq[word] = wordfreq.setdefault(word, 0) + 1
print wordfreq

那是这样运行的:

$ cat file  # file is 
this is the textfile, and it is used to take words and count
$ python work.py  # indented manually 
{'and': 2, 'count': 1, 'used': 1, 'this': 1, 'is': 2, 
 'it': 1, 'to': 1, 'take': 1, 'words': 1, 
 'the': 1, 'textfile': 1}

使用re.split(pattern, string, maxsplit=0, flags=0)

只需要更改for循环:for i in re.split(r"[,\s]+", words):,即可生成正确的输出。

编辑:最好查找所有字母数字字符,因为可能有多个标点符号。

>>> re.findall(r'[\w]+', words) # manually indent output  
['this', 'is', 'the', 'textfile', 'and', 
  'it', 'is', 'used', 'to', 'take', 'words', 'and', 'count']

使用for循环的方式如下:for word in re.findall(r'[\w]+', words):

如果不使用read(),应该怎么编写代码:

文件内容为:

$ cat file
This is the text file, and it is used to take words and count. And multiple
Lines can be present in this file.
It is also possible that Same words repeated in with capital letters.

代码是:

$ cat work.py
import re
wordfreq = {}
with open('file') as f:
    for line in f:
        for word in re.findall(r'[\w]+', line.lower()):
            wordfreq[word] = wordfreq.setdefault(word, 0) + 1
  
print wordfreq

使用lower()将大写字母转换为小写字母。

输出:

$python work.py  # manually strip output  
{'and': 3, 'letters': 1, 'text': 1, 'is': 3, 
 'it': 2, 'file': 2, 'in': 2, 'also': 1, 'same': 1, 
 'to': 1, 'take': 1, 'capital': 1, 'be': 1, 'used': 1, 
 'multiple': 1, 'that': 1, 'possible': 1, 'repeated': 1, 
 'words': 2, 'with': 1, 'present': 1, 'count': 1, 'this': 2, 
 'lines': 1, 'can': 1, 'the': 1}

4
你对所有这些如何相互配合的阐述非常出色,这应该是被采纳的答案。 - himanshuxd

11
from collections import Counter
t = 'this is the textfile, and it is used to take words and count'

dict(Counter(t.split()))
>>> {'and': 2, 'is': 2, 'count': 1, 'used': 1, 'this': 1, 'it': 1, 'to': 1, 'take': 1, 'words': 1, 'the': 1, 'textfile,': 1}

最好在计数之前去除标点符号:

dict(Counter(t.replace(',', '').replace('.', '').split()))
>>> {'and': 2, 'is': 2, 'count': 1, 'used': 1, 'this': 1, 'it': 1, 'to': 1, 'take': 1, 'words': 1, 'the': 1, 'textfile': 1}

2
以下将字符串拆分为列表,使用split()函数进行循环,并使用Python的count()函数计算句子中每个项目的频率。单词i及其频率以元组形式放置在空列表ls中,然后使用dict()将其转换为键值对。
sentence = 'this is the textfile, and it is used to take words and count'.split()
ls = []  
for i in sentence:

    word_count = sentence.count(i)  # Pythons count function, count()
    ls.append((i,word_count))       


dict_ = dict(ls)

print dict_

输出: {'and': 2, 'count': 1, 'used': 1, 'this': 1, 'is': 2, 'it': 1, 'to': 1, 'take': 1, 'words': 1, 'the': 1, 'textfile,': 1}


7
这样做非常低效,因为它在每个单词中都会再次遍历整个字符串,而不是只通过一次遍历。复杂度将从O(n)变成O(n²)。 - Michael

1
#open your text book,Counting word frequency
File_obj=open("Counter.txt",'r')
w_list=File_obj.read()
print(w_list.split())
di=dict()
for word in w_list.split():


    if word in di:
        di[word]=di[word] + 1

    else:
        di[word]=1



max_count=max(di.values())
largest=-1
maxusedword=''
for k,v in di.items():
    print(k,v)
    if v>largest:
        largest=v
        maxusedword=k

print(maxusedword,largest)

1

你也可以使用 int 类型的默认字典。

 from collections import defaultdict
 wordDict = defaultdict(int)
 text = 'this is the textfile, and it is used to take words and count'.split(" ")
 for word in text:
    wordDict[word]+=1

解释:我们初始化一个默认字典,其值为int类型。这样任何键的默认值都将为0,我们不需要检查字典中是否存在键。然后,我们使用空格将文本拆分为单词列表。接下来,我们遍历该列表并增加单词计数的数量。


1
wordList = 'this is the textfile, and it is used to take words and count'.split()
wordFreq = {}

# Logic: word not in the dict, give it a value of 1. if key already present, +1.
for word in wordList:
    if word not in wordFreq:
        wordFreq[word] = 1
    else:
        wordFreq[word] += 1

print(wordFreq)

1
sentence = "this is the textfile, and it is used to take words and count"

# split the sentence into words.
# iterate thorugh every word

counter_dict = {}
for word in sentence.lower().split():
# add the word into the counter_dict initalize with 0
  if word not in counter_dict:
    counter_dict[word] = 0
# increase its count by 1   
  counter_dict[word] =+ 1

0
我的方法是从头开始做几件事情:
1. 从文本输入中删除标点符号。 2. 制作单词列表。 3. 删除空字符串。 4. 遍历列表。 5. 将每个新单词作为键插入字典,并赋值为1。 6. 如果一个单词已经存在作为键,则将其值增加一。
text = '''this is the textfile, and it is used to take words and count'''
word = '' #This will hold each word

wordList = [] #This will be collection of words
for ch in text: #traversing through the text character by character
#if character is between a-z or A-Z or 0-9 then it's valid character and add to word string..
    if (ch >= 'a' and ch <= 'z') or (ch >= 'A' and ch <= 'Z') or (ch >= '0' and ch <= '9'): 
        word += ch
    elif ch == ' ': #if character is equal to single space means it's a separator
        wordList.append(word) # append the word in list
        word = '' #empty the word to collect the next word
wordList.append(word)  #the last word to append in list as loop ended before adding it to list
print(wordList)

wordCountDict = {} #empty dictionary which will hold the word count
for word in wordList: #traverse through the word list
    if wordCountDict.get(word.lower(), 0) == 0: #if word doesn't exist then make an entry into dic with value 1
        wordCountDict[word.lower()] = 1
    else: #if word exist then increment the value by one
        wordCountDict[word.lower()] = wordCountDict[word.lower()] + 1
print(wordCountDict)

另一种方法:

text = '''this is the textfile, and it is used to take words and count'''
for ch in '.\'!")(,;:?-\n':
    text = text.replace(ch, ' ')
wordsArray = text.split(' ')
wordDict = {}
for word in wordsArray:
    if len(word) == 0:
        continue
    else:
        wordDict[word.lower()] = wordDict.get(word.lower(), 0) + 1
print(wordDict)

0

你也可以采用这种方法。但是在读取文件后,你需要先将文本文件的内容存储在一个字符串变量中。 这样,你就不需要使用或导入任何外部库。

s = "this is the textfile, and it is used to take words and count"

s = s.split(" ")
d = dict()
for i in s:
  c = ""
  if i.isalpha() == True: 
    if i not in d:
      d[i] = 1
    else:
      d[i] += 1
  else:
    for j in i:
      l = len(j)
      if j.isalpha() == True:
        c+=j    
    if c not in d:
      d[c] = 1
    else:
      d[c] += 1


print(d)

结果:

enter image description here


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接