使用循环找出列表中所有唯一的单词

Question

使用循环找出列表中所有唯一的单词

3

我正在尝试根据从文本文件中提取的所有单词列表创建唯一单词列表。我唯一的问题是用于迭代两个列表的算法。

def getUniqueWords(allWords):
    uniqueWords = []
    uniqueWords.append(allWords[0])
    for i in range(len(allWords)):
        for j in range(len(uniqueWords)):
            if allWords[i] == uniqueWords[j]:
                pass
            else:
                uniqueWords.append(allWords[i])
                print uniqueWords[j]
    print uniqueWords
    return uniqueWords

如您所见，我创建了一个空列表，并开始迭代两个列表。同时，我附加了列表中的第一项，因为由于某些原因它不会尝试匹配单词。毕竟，在空列表中，list [0] 不存在。如果有人能帮助我解决如何正确迭代它，以便我可以生成单词列表，那将非常好。

打印uniqueWords [j] 只是用来进行调试，以便我在处理列表时能够看到输出内容。

- impactblu

4个回答

2

我不喜欢作业问题，因为它们会让你选择较差的算法。更好的选择是使用一个 set 或者 trie。

只需要做两个小改动就可以修复您的程序。

def getUniqueWords(allWords):
    uniqueWords = []
    uniqueWords.append(allWords[0])
    for i in range(len(allWords)):
        for j in range(len(uniqueWords)):
            if allWords[i] == uniqueWords[j]:
                break
        else:
            uniqueWords.append(allWords[i])
            print uniqueWords[j]
    print uniqueWords
    return uniqueWords

首先，当你看到单词已经存在时，需要停止循环。

        for j in range(len(uniqueWords)):
            if allWords[i] == uniqueWords[j]:
                break  # break out of the loop since you found a match

第二种方法是使用for/else结构而不是if/else。

        for j in range(len(uniqueWords)):
            if allWords[i] == uniqueWords[j]:
                break
        else:
            uniqueWords.append(allWords[i])
            print uniqueWords[j]

- John La Rooy

最初我所做的只是 uniqueWords = set(allWords)，然后我又做了 uniqueWords = list(uniqueWords)。他也从未讲过元组，这使得我在其他作业中把 x 行代码轻松缩减到 3 行。 - impactblu

1

也许你可以使用collections.Counter类？（特别是如果您还想计算源文档中每个单词出现的次数）。

http://docs.python.org/2/library/collections.html?highlight=counter#collections.Counter

import collections.Counter
def getUniqueWords(allWords):
    uniqueWords = Counter()

    for word in allWords:
        uniqueWords[word]+=1
    return uniqueWords.keys()

另一方面，如果您只想计算单词数，只需使用 set：

def getUniqueWords(allWords):
    uniqueWords =set()

    for word in allWords:
        uniqueWords.add(word)
    return uniquewords #if you want to return them as a set
    OR
    return list(uniquewords) #if you want to return a list

如果你被限制在循环中，而且输入相对较大，使用循环 + 二分查找比仅使用循环更好 - 就像这样：

def getUniqueWords(allWords):
   uw = []
   for word in allWords:
       (lo,hi) = (0,len(uw)-1)
       m = -1
       while hi>=lo and m==-1:
           mid = lo + (hi-lo)/2
           if uw[mid]==word:
              m = mid
           elif uw[mid]<word:
              lo = mid+1
           else:
              hi = mid-1
       if m==-1:
           m = lo
           uw = uw[:m]+[word]+uw[m:]
   return uw

如果您的输入大约有100000个单词，使用这种方法和简单循环的区别在于，执行程序时您的电脑不会发出噪音 :)

- Ashalynd

我只能使用循环。我知道如果我使用一个集合，这将使它变得容易1000倍，但是是的。 - impactblu

我明白了，那么排序后的列表可能是最好的方法。 - Ashalynd

你不能使用 bisect 模块吗？ - John La Rooy

0

你可以使用 set 来获取唯一的单词：

def getUniqueWords(allWords) :
    uniqueWords = list({i for i in allWords})
    return uniqueWords

print getUniqueWords(['a','b','c','a','b']);

结果： ['c', 'a', 'b']

- Sidrah Madiha Siddiqui

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- fabio · Accepted Answer

我不是Python专家，但我认为这应该可以工作：

uniqueWords = [] 
for i in allWords:
      if not i in uniqueWords:
          uniqueWords.append(i);

return uniqueWords

编辑：

我进行了测试，它可以正常工作，并从列表中仅返回唯一的单词：

def getUniqueWords(allWords) :
    uniqueWords = [] 
    for i in allWords:
        if not i in uniqueWords:
            uniqueWords.append(i)
    return uniqueWords

print getUniqueWords(['a','b','c','a','b']);

['a', 'b', 'c']