Python：列表的字典

Question

Python：列表的字典

3

def makecounter():
     return collections.defaultdict(int)

class RankedIndex(object):
  def __init__(self):
    self._inverted_index = collections.defaultdict(list)
    self._documents = []
    self._inverted_index = collections.defaultdict(makecounter)


def index_dir(self, base_path):
    num_files_indexed = 0
    allfiles = os.listdir(base_path)
    self._documents = os.listdir(base_path)
    num_files_indexed = len(allfiles)
    docnumber = 0
    self._inverted_index = collections.defaultdict(list)

    docnumlist = []
    for file in allfiles: 
            self.documents = [base_path+file] #list of all text files
            f = open(base_path+file, 'r')
            lines = f.read()

            tokens = self.tokenize(lines)
            docnumber = docnumber + 1
            for term in tokens:  
                if term not in sorted(self._inverted_index.keys()):
                    self._inverted_index[term] = [docnumber]
                    self._inverted_index[term][docnumber] +=1                                           
                else:
                    if docnumber not in self._inverted_index.get(term):
                        docnumlist = self._inverted_index.get(term)
                        docnumlist = docnumlist.append(docnumber)
            f.close()
    print '\n \n'
    print 'Dictionary contents: \n'
    for term in sorted(self._inverted_index):
        print term, '->', self._inverted_index.get(term)
    return num_files_indexed
    return 0

我执行这段代码时遇到了索引错误：列表索引超出范围。

以上代码生成一个字典索引，将“term”作为键存储，将该术语出现的文档编号存储为列表。例如：如果单词“cat”在1.txt、5.txt和7.txt中出现，则字典将具有： cat <- [1,5,7]

现在，我必须修改它以添加术语频率，因此如果单词“cat”在文档1中出现两次，在文档5中出现三次，在文档7中出现一次：期望结果： term <-[[docnumber, term freq], [docnumber,term freq]] <-- 一个字典中的列表！ cat <- [[1,2],[5,3],[7,1]]

我尝试过对代码进行更改，但是没有任何效果。我不知道如何修改此数据结构以实现上述目标。

提前感谢您的帮助。

- csguy11

3个回答

1

这里有一个通用算法供您使用，但您需要调整部分代码来适应它。它会生成一个包含每个文件单词计数字典的字典。

filedicts = {}
for file in allfiles:
  filedicts[file] = {}

  for term in terms:
    filedict.setdefault(term, 0)
    filedict[term] += 1

- mikerobi

0

也许你可以创建一个简单的类来处理（文档名称，频率）。

然后你的字典可以有这个新数据类型的列表。你也可以使用列表的列表，但是使用一个单独的数据类型会更加清晰。

- JoshD

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Alex Martelli · Accepted Answer

首先，使用工厂。从这里开始：

def makecounter():
    return collections.defaultdict(int)

并且之后使用

self._inverted_index = collections.defaultdict(makecounter)

在 for term in tokens: 循环中，

        for term in tokens:  
                self._inverted_index[term][docnumber] +=1

今日免费次数已满, 请开通会员/明日再来

{1:2,5:3,7:1}

在您的示例情况中。由于您希望每个self._inverted_index [term]中都是一个列表，因此在循环结束后立即添加以下内容：

self._inverted_index = dict((t,[d,v[d] for d in sorted(v)])
                            for t in self._inverted_index)

一旦构建完成（不管是用这种方法还是其他方法 - 我只是展示了一种简单的构建方式！），这个数据结构实际上会像您不必要地使它难以构建一样难以使用（字典嵌套字典更有用且易于使用和构建），但是，嘿，一个人的肉，食之无味，弃之可惜。