计算嵌套列表中所有元素的数量

6

我有一个包含多个列表的列表,并希望创建一个数据框,其中包含所有唯一元素的计数。以下是我的测试数据:

test = [["P1", "P1", "P1", "P2", "P2", "P1", "P1", "P3"],
        ["P1", "P1", "P1"],
        ["P1", "P1", "P1", "P2"],
        ["P4"],
        ["P1", "P4", "P2"],
        ["P1", "P1", "P1"]]

我可以使用Counterfor循环来完成这样的操作:

from collections import Counter
for item in test:
     print(Counter(item))

但是如何将这个循环的结果汇总成一个新的数据框呢?
期望的输出结果是一个数据框:
P1 P2 P3 P4
15 4  1  2
4个回答

6

以下是其中的一种方式。

from collections import Counter
from itertools import chain

test = [["P1", "P1", "P1", "P2", "P2", "P1", "P1", "P3"],
        ["P1", "P1", "P1"],
        ["P1", "P1", "P1", "P2"],
        ["P4"],
        ["P1", "P4", "P2"],
        ["P1", "P1", "P1"]]

c = Counter(chain.from_iterable(test))

for k, v in c.items():
    print(k, v)

# P1 15
# P2 4
# P3 1
# P4 2    

输出为数据帧:

df = pd.DataFrame.from_dict(c, orient='index').transpose()

#    P1 P2 P3 P4
# 0  15  4  1  2

3
已经有处理导入的功能可以像您这样使用。它是 from itertools import chain.from_iterable as concat - Ma0
2
@Ev.Kounis 实际上不完全正确,from itertools import chain as concat 是可能的,尽管我同意他们目前的一行代码很恶心,但除此之外还是个不错的答案。(我进行了编辑,希望没问题) - Chris_Rands
1
你不需要循环来将其转换为DataFrame:pd.DataFrame.from_dict(c, orient='index').transpose() 或者更简短的方式是:pd.DataFrame(c, index=[0]) - CodeZero
@StefanPochmann 好吧,最初他们正在重新命名 itertools.chain.from_iterable,我一直认为这个名称太长了。无论如何,我认为他们指的是 http://toolz.readthedocs.io/en/latest/api.html#toolz.itertoolz.concat。 - Chris_Rands

5

为了获得更好的性能,你应该使用以下其中之一:

  • collections.Counter with itertools.chain.from_iterable as:

    >>> from collections import Counter
    >>> from itertools import chain
    
    >>> Counter(chain.from_iterable(test))
    Counter({'P1': 15, 'P2': 4, 'P4': 2, 'P3': 1})
    
  • OR, yo should be using collections.Counter with list comprehension (requires one less import of itertools with same performance) as:

    >>> from collections import Counter
    
    >>> Counter([x for a in test for x in a])
    Counter({'P1': 15, 'P2': 4, 'P4': 2, 'P3': 1})
    

继续阅读以获取更多替代方案和性能比较结果。如果不需要,可以跳过。


方法一:将子列表连接起来创建单个list,并使用collections.Counter查找计数。

  • Solution 1: Concatenate list using itertools.chain.from_iterable and find the count using collections.Counter as:

    test = [
        ["P1", "P1", "P1", "P2", "P2", "P1", "P1", "P3"],
        ["P1", "P1", "P1"],
        ["P1", "P1", "P1", "P2"],
        ["P4"],
        ["P1", "P4", "P2"],
        ["P1", "P1", "P1"]
    ]
    
    from itertools import chain 
    from collections import Counter
    
    my_counter = Counter(chain.from_iterable(test)) 
    
  • Solution 2: Combine list using list comprehension as:

    from collections import Counter
    
    my_counter = Counter([x for a in my_list for x in a])
    
  • Solution 3: Concatenate list using sum

    from collections import Counter
    
    my_counter = Counter(sum(test, []))
    

方法二: 使用 collections.Counter 计算每个子列表中元素的数量,然后对列表中的 Counter 对象进行 sum 操作。

  • Solution 4: Count objects of each sublist using collections.Counter and map as:

    from collections import Counter
    
    my_counter = sum(map(Counter, test), Counter())
    
  • Solution 5: Count objects of each sublist using list comprehension as:

    from collections import Counter
    
    my_counter = sum([Counter(t) for t in test], Counter())
    
在上述所有解决方案中,my_counter 将会保存这个值:
>>> my_counter
Counter({'P1': 15, 'P2': 4, 'P4': 2, 'P3': 1})

性能比较

下方是Python 3中对1000个子列表和每个子列表中有100个元素进行的timeit比较:

  1. Fastest using chain.from_iterable (17.1 msec)

    mquadri$ python3 -m timeit "from collections import Counter; from itertools import chain; my_list = [list(range(100)) for i in range(1000)]" "Counter(chain.from_iterable(my_list))"
    100 loops, best of 3: 17.1 msec per loop 
    
  2. Second on the list is using list comprehension to combine the list and then do the Count (similar result as above but without the additional import of itertools) (18.36 msec)

    mquadri$ python3 -m timeit "from collections import Counter; my_list = [list(range(100)) for i in range(1000)]" "Counter([x for a in my_list for x in a])"
    100 loops, best of 3: 18.36 msec per loop
    
  3. Third in terms of performance is using Counter on sublists within list comprehension : (162 msec)

    mquadri$ python3 -m timeit "from collections import Counter; my_list = [list(range(100)) for i in range(1000)]" "sum([Counter(t) for t in my_list], Counter())"
    10 loops, best of 3: 162 msec per loop
    
  4. Fourth on the list is via using Counter with map (results are quite similar to the one using list comprehension above) (176 msec)

    mquadri$ python3 -m timeit "from collections import Counter; my_list = [list(range(100)) for i in range(1000)]" "sum(map(Counter, my_list), Counter())"
    10 loops, best of 3: 176 msec per loop
    
  5. Solution using sum to concatenate the list is too slow (526 msec)

    mquadri$ python3 -m timeit "from collections import Counter; my_list = [list(range(100)) for i in range(1000)]" "Counter(sum(my_list, []))"
    10 loops, best of 3: 526 msec per loop
    

是的,但这样它就不会对“计数器”进行求和,而是在组合列表上运行单个计数器。然而,我认为我也应该在答案中提到这一点(以相同的性能跳过itertools的导入的好方法)。 - Moinuddin Quadri

1

这里有另一种方法可以做到这一点,使用itertools.groupby

>>> from itertools import groupby, chain

>>> out = [(k,len(list(g))) for k,g in groupby(sorted(chain(*test)))]
>>> out
>>> [('P1', 15), ('P2', 4), ('P3', 1), ('P4', 2)]

将其转换为类似于字典的格式:
>>> dict(out)
>>> {'P2': 4, 'P3': 1, 'P1': 15, 'P4': 2}

将其转换为数据框,请使用:
>>> import pandas as pd

>>> pd.DataFrame(dict(out), index=[0])
   P1  P2  P3  P4
0  15   4   1   2

0
函数“set”仅保留列表中的唯一元素。因此,使用“len(set(mylinst))”,您可以获得列表中唯一元素的数量。然后,您只需要对其进行迭代即可。
dict_nb_item = {}
i = 0
for test_item in test:
    dict_nb_item[i] = len(set(test_item))
    i += 1
print(dict_nb_item)

这个是如何产生OP想要的结果的? - Ma0
这个输出是 {0: 3, 1: 1, 2: 2, 3: 1, 4: 3, 5: 1} (Python-3),这显然不是 OP 所期望的。 - rollstuhlfahrer

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接