按字母顺序对元组进行排序

Question

按字母顺序对元组进行排序

4

我正在尝试使用Python按字母顺序对二元组列表进行排序。我的输出现在看起来像这样：

('hello', 'how')
('how', 'are')
('are', 'you')
('you', '?')
('Are', 'you')
('you', 'okay')
('okay', '?')

我希望输出结果按字母顺序排列，并且每个二元组只出现一次，最好还能附带频率计数。

('are', 'you'), 2
('hello', 'how'), 1
('how', 'are'), 1
('okay', '?'), 1
('you', 'okay'), 1
('you', '?'), 1

我的代码看起来像这样：

def bigram(x):
    with open (x, 'r', encoding='utf-8') as f:
        mylist = f.read()
        n = 2
        grams = ngrams(nltk.word_tokenize(mylist), n)
        for bigrams in grams:
            return bigrams

我很乐意提供帮助，谢谢！

- S.H

要对元组进行排序，请使用带有关键字的sorted函数：sorted(list_of_tuples, key = lambda x: x[0])。 - Arco Bast

你是否有特殊的原因期望在 ('you', '?') 之前打印 ('you', 'okay')？这不符合 ASCII 码。 - Alfe

1

@Alfe 也许第一个元素按字母顺序排列，第二个元素按相反的字母顺序排列？ - Nick stands with Ukraine

@NickA 是的，就是这样。我怎么会忘了呢？ - Alfe

3个回答

2

阅读 grams 后，您需要执行以下几步：

首先，将所有内容转换为小写，以便更容易找到重复项：

grams = [ (a.lower(), b.lower()) for (a, b) in grams ]

第二步，将“grams”分组并计数：

import collections
counted = collections.Counter(grams)

第三步，对计数的内容进行排序：

for gram, count in sorted(counted.items()):
    print gram, count

- Alfe

1

是的，+1，或者在一行中 sorted(collections.Counter((a.lower(),b.lower()) for a,b in grams).items())。 - Chris_Rands

@Alfe 谢谢！当我拆分列表时，这个方法有效，但是当我使用nltk.word(tokenize)来对列表进行标记化时，它就不起作用了...我收到了一个错误信息： File "/Users/sohe/Desktop/bigram.py", line 16, in <listcomp> grams = [(a.lower(), b.lower()) for (a,b) in grams] ValueError: too many values to unpack (expected 2). 你有任何想法是什么导致了这个错误吗？ - S.H

这是由于列表中有一个大小错误的元组（我不知道它是怎么出现的）导致的。您可以使用以下代码将任意大小的元组转换为小写：grams = [ tuple(q.lower() for q in gram) for gram in grams ]。 - Alfe

0

看一下Counter和sorted函数。使用Counter函数统计每个bigram的出现次数，使用sorted函数按字母顺序对bigram及其对应的计数进行排序。

- Felix

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- RaminNietzsche · Accepted Answer

首先，您必须将所有数据转换为小写：

L = [('hello', 'how'), ('how', 'are'), ('are', 'you') ,('you', '?'), ('Are', 'you') ,('you', 'okay') ,('okay', '?')]
L = [tuple(s.lower() for s in x) for x in L]

然后计算频率：

import collections
counter=collections.Counter(L)

然后您可以对此进行排序：

print(collections.OrderedDict(sorted(counter.items())))
#OrderedDict([(('are', 'you'), 2), (('hello', 'how'), 1), (('how', 'are'), 1), (('okay', '?'), 1), (('you', '?'), 1), (('you', 'okay'), 1)])