在一个二维列表中找到最常见的字符串

Question

在一个二维列表中找到最常见的字符串

7

我有一个二维列表：

arr = [['Mohit', 'shini','Manoj','Mot'],
      ['Mohit', 'shini','Manoj'],
      ['Mohit', 'Vis', 'Nusrath']]

我希望在2D列表中找到最常见的元素。在上面的示例中，最常见的字符串是'Mohit'。

我知道我可以使用两个for循环和一个字典来进行蛮力计算，但是否有更有效的方法使用numpy或任何其他库？

嵌套列表的长度可能不同。

还能有人添加他们方法的时间吗？以找到最快的方法。也需要说明它可能不太高效的限制。

编辑

这是我的系统上不同方法的时间：

#timegb
%%timeit
collections.Counter(chain.from_iterable(arr)).most_common(1)[0][0]
5.91 µs ± 115 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

#Kevin Fang and Curious Mind
%%timeit
flat_list = [item for sublist in arr for item in sublist]
collections.Counter(flat_list).most_common(1)[0]
6.42 µs ± 501 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
c = collections.Counter(item for sublist in arr for item in sublist).most_common(1)c[0][0]
6.79 µs ± 449 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

#Mayank Porwal
def most_common(lst):
    return max(set(lst), key=lst.count)
%%timeit
ls = list(chain.from_iterable(arr))
most_common(ls)
2.33 µs ± 42.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

#U9-Forward
%%timeit
l=[x for i in arr for x in i]
max(l,key=l.count)
2.6 µs ± 68.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

马扬克·波尔沃的方法在我的系统上运行速度最快。

- Mohit Motwani

在完整的二维数组中出现最多次数。 - Mohit Motwani

嵌套数组中元素数量(n)与嵌套数组数量(m)之间是否有任何限制？即m >> n还是n << m？ - bigdata2

@bigdata2 不是很大。2D列表不太可能非常大，甚至其中的元素也不会很多。 - Mohit Motwani

我建议不要使用 arr 这个名称，因为这是一个列表，而 arr 这个名称通常暗示着 array.array 或者 numpy.array。 - timgeb

1

@MohitMotwani 时间有点取决于列表的长度和其中唯一元素的数量。max(set...)解决方案对于具有少量唯一元素的列表非常快速。 - timgeb

显示剩余5条评论

5个回答

4

使用itertools.chain.from_iterable将列表展开
应用Counter计数器。

演示：

>>> from itertools import chain
>>> from collections import Counter
>>> 
>>> lst = [['Mohit', 'shini','Manoj','Mot'],
...:      ['Mohit', 'shini','Manoj'],
...:      ['Mohit', 'Vis', 'Nusrath']]
...:      
>>> Counter(chain.from_iterable(lst)).most_common(1)[0][0]
'Mohit'

详情：

>>> list(chain.from_iterable(lst))
['Mohit',
 'shini',
 'Manoj',
 'Mot',
 'Mohit',
 'shini',
 'Manoj',
 'Mohit',
 'Vis',
 'Nusrath']
>>> Counter(chain.from_iterable(lst))
Counter({'Manoj': 2, 'Mohit': 3, 'Mot': 1, 'Nusrath': 1, 'Vis': 1, 'shini': 2})
>>> Counter(chain.from_iterable(lst)).most_common(1)
[('Mohit', 3)]

一些时间信息：

>>> lst = lst*100
>>> %timeit Counter(chain.from_iterable(lst)).most_common(1)[0][0] # timgeb
53.7 µs ± 411 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit max([x for i in lst for x in i], key=l.count) # U9-Forward
207 µs ± 389 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit Counter([x for sublist in lst for x in sublist]).most_common(1)[0][0] # Curious_Mind/Kevin Fang #1
75.2 µs ± 2.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit Counter(item for sublist in lst for item in sublist).most_common(1)[0][0] # Kevin Fang #2
95.2 µs ± 2.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit flat = list(chain.from_iterable(lst)); max(set(flat), key=flat.count) # Mayank Porwal
98.4 µs ± 178 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

（注意，Kevin Fang的第二个解决方案比第一个解决方案略慢，但更节约内存。）

- timgeb

1

是的，看起来chain是从这个问题中最快的展平方法。 - Kevin Fang

@timegb 谢谢，这个方法可行。你能否添加你的方法所需的时间？这样可以更容易地比较最快的方法。 - Mohit Motwani

1

@MohitMotwani 所有的时间测试都需要在同一台计算机上进行，这样才能进行比较。我将在一分钟内进行一些时间测试。 - timgeb

@timegb 非常感激。谢谢。 - Mohit Motwani

@MohitMotwani 抱歉，时间有些混乱... 我再次进行了编辑。 - timgeb

2

类似这样的：

In [920]: from itertools import chain
In [923]: arr = list(chain.from_iterable(arr)) ## flatten into 1-D array
In [922]: def most_common(lst):
     ...:     return max(set(lst), key=lst.count)

In [924]: most_common(arr)
Out[924]: 'Mohit'

时间：

from itertools import chain
import time
start_time = time.time()

arr = [['Mohit', 'shini','Manoj','Mot'],
      ['Mohit', 'shini','Manoj'],
      ['Mohit', 'Vis', 'Nusrath']]


arr = list(chain.from_iterable(arr))
arr = arr*100

def most_common(lst):
    return max(set(lst), key=lst.count)

print(most_common(arr))
print("--- %s seconds ---" % (time.time() - start_time))

mayankp@mayank:~$ python t1.py 
Mohit
--- 0.000154972076416 seconds ---

- Mayank Porwal

谢谢。这个可行。你能否添加你的方法所花费的时间？这样更容易检查最快的方法。 - Mohit Motwani

@MohitMotwani 已经编辑了时间。还请注意我的代码中的 arr= arr*100。 - Mayank Porwal

2

有一种方法可以这样做：

import collections
import time
start_time = time.time()
arr = [['Mohit', 'shini','Manoj','Mot'],
      ['Mohit', 'shini','Manoj'],
      ['Mohit', 'Vis', 'Nusrath']]

c = collections.Counter([x for sublist in arr for x in sublist])
print(c.most_common(1) )
print("--- %s seconds ---" % (time.time() - start_time))

耗时: 0.00016713142395 秒

演示: http://tpcg.io/NH3zjm

- A l w a y s S u n n y

1

@Curios_Mind 谢谢。这个方法可行。你能否添加你的方法所需时间？这样更容易检查最快的方法。 - Mohit Motwani

1

@MohitMotwani 已编辑 - A l w a y s S u n n y

1

或者为什么不这样做：

l=[x for i in arr for x in i]
max(l,key=l.count)

代码示例：

>>> arr = [['Mohit', 'shini','Manoj','Mot'],
      ['Mohit', 'shini','Manoj'],
      ['Mohit', 'Vis', 'Nusrath']]
>>> l=[x for i in arr for x in i]
>>> max(l,key=l.count)
'Mohit'
>>>

- U13-Forward

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Kevin Fang · Accepted Answer

我建议将二维数组展开，然后使用计数器找出最常见的元素。

flat_list = [item for sublist in arr for item in sublist]
from collections import Counter
Counter(flat_list).most_common(1)[0]
# ('Mohit', 3)
Counter(flat_list).most_common(1)[0][0]
# 'Mohit'

不确定这是否是最快的方法。

编辑：

@timgeb的答案使用itertools.chain更快地展平列表。

@schwobaseggl建议一种更节省空间的方式：

from collections import Counter
c = Counter(item for sublist in arr for item in sublist).most_common(1)
# [('Mohit', 3)]
c[0][0]
# 'Mohit'