如何高效地获取矩阵中的“最大值”

Question

如何高效地获取矩阵中的“最大值”

python-3.xpandasperformancematrixiteration

3

我有以下问题：我使用pandas模块打开了一个矩阵，该矩阵的每个单元格都有一个介于-1和1之间的数字。我想找到一行中最大的“可能”值，该值也不是另一行的最大值。

例如，如果2行在同一列具有最大值，则比较两个值并选择较大的一个，然后对于最大值小于其他行的行，我选择第二大的值（并且再次进行相同的分析）。

为了更好地解释自己，请考虑我的代码。

import pandas as pd

matrix = pd.read_csv("matrix.csv") 
# this matrix has an id (or name) for each column 
# ... and the firt column has the id of each row
results = pd.DataFrame(np.empty((len(matrix),3),dtype=pd.Timestamp),columns=['id1','id2','max_pos'])

l = len(matrix.col[[0]]) # number of columns

while next = 1:
   next = 0
   for i in range(0, len(matrix)):
       max_column = str(0)
       for j in range(1, l): # 1 because the first column is an id
           if matrix[max_column][i] < matrix[str(j)][i]:
               max_column = str(j)
       results['id1'][i] = str(i) # I coul put here also matrix['0'][i]
       results['id2'][i] = max_column
       results['max_pos'][i] = matrix[max_column][i]

   for i in range(0, len(results)): #now I will check if two or more rows have the same max column
       for ii in range(0, len(results)):
       # if two id1 has their max in the same column, I keep it with the biggest 
       # ... max value and chage the other to "-1" to iterate again
           if (results['id2'][i] == results['id2'][ii]) and (results['max_pos'][i] < results['max_pos'][ii]):
               matrix[results['id2'][i]][i] = -1
               next = 1

举个例子：

#consider
pd.DataFrame({'a':[1, 2, 5, 0], 'b':[4, 5, 1, 0], 'c':[3, 3, 4, 2], 'd':[1, 0, 0, 1]})

   a  b  c  d
0  1  4  3  1
1  2  5  3  0
2  5  1  4  0
3  0  0  2  1

#at the first iterarion I will have the following result

0  b  4 # this means that the row 0 has its maximum at column 'b' and its value is 4
1  b  5
2  a  5
3  c  2

#the problem is that column b is the maximum of row 0 and 1, but I know that the maximum of row 1 is bigger than row 0, so I take the second maximum of row 0, then:

0  c  3
1  b  5
2  a  5
3  c  2

#now I solved the problem for row 0 and 1, but I have that the column c is the maximum of row 0 and 3, so I compare them and take the second maximum in row 3 

0  c  3
1  b  5
2  a  5
3  d  1

#now I'm done. In the case that two rows have the same column as maximum and also the same number, nothing happens and I keep with that values.

#what if the matrix would be 
pd.DataFrame({'a':[1, 2, 5, 0], 'b':[5, 5, 1, 0], 'c':[3, 3, 4, 2], 'd':[1, 0, 0, 1]})

   a  b  c  d
0  1  5  3  1
1  2  5  3  0
2  5  1  4  0
3  0  0  2  1

#then, at the first itetarion the result will be:

0  b  5
1  b  5
2  a  5
3  c  2

#then, given that the max value of row 0 and 1 is at the same column, I should compare the maximum values
# ... but in this case the values are the same (both are 5), this would be the end of iterating 
# ... because I can't choose between row 0 and 1 and the other rows have their maximum at different columns...

这段代码适用于我使用100x100的矩阵等小型数据集，但如果矩阵大小增加到50,000x50,000，该代码将需要很长时间才能完成。我知道我的代码可能是最低效的处理方式，但我不知道如何解决这个问题。

我已经在阅读有关Python中线程的资料，但如果我放置50,000个线程，这并不能提高计算机的效率。我还尝试使用一些函数，例如.max()，但我无法获取列的最大值并将其与其他最大值进行比较...

如果有人能够帮助我或给我一些建议来提高代码效率，我将非常感激。

- hllspwn

我想要找到的是一行中最大的“可能”值，同时也不是另一行中的最大值。- 当多行具有相同的最大值时会发生什么？ - Peter Leimbigler

例如，如果第3列在第2行和第4行中具有最大值，则比较第2行和第4行之间的值。假设第2行的值大于第4行的值，则将该最大值保留在第2行，并取第4行的第二大值（然后，另一列将成为最大值）。如果第2行和第4行的值都相同，则不进行任何更改。 - hllspwn

完成了，抱歉之前可能表述不太清楚，希望这个例子能有所帮助。感谢@Matt W.的建议。 - hllspwn

不用道歉！感谢您的澄清，现在我理解得更清楚了。我会看一下的。 - Matt W.

还有一个问题 - 如果同一列有多行具有相同的最大值，会发生什么？在你的示例中，将列b中的4替换为5，并按照你的逻辑进行。 - Matt W.

显示剩余2条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Matt W. · Accepted Answer

我需要更多关于此事的信息。您想要实现什么目标？

这将帮助您部分地完成任务，但为了完全实现您的目标，我需要更多的上下文信息。

我们将从collections中导入numpy、random和Counter：

import numpy as np
import random 
from collections import Counter

我们将创建一个随机的50k x 50k的矩阵，其中的数字介于-10M和+10M之间。

mat = np.random.randint(-10000000,10000000,(50000,50000))

现在我们只需要使用以下列表推导式来获取每一行的最大值：

maximums = [max(mat[x,:]) for x in range(len(mat))]

现在我们想找出哪些在其他行中不是最大值。我们可以在最大值列表上使用Counter，以找出每个最大值的数量。Counter返回一个计数器对象，类似于以最大值为键，以其出现次数为值的字典。然后我们进行字典推导，其中值等于1。这将给我们仅出现一次的最大值。我们使用.keys()函数获取数字本身，然后将其转换为列表。

c = Counter(maximums)
{9999117: 15,
9998584: 2,
9998352: 2,
9999226: 22,
9999697: 59,
9999534: 32,
9998775: 8,
9999288: 18,
9998956: 9,
9998119: 1,
...}

k = list( {x: c[x] for x in c if c[x] == 1}.keys() )

[9998253,
 9998139,
 9998091,
 9997788,
 9998166,
 9998552,
 9997711,
 9998230,
 9998000,
...]

最后，我们可以使用以下列表推导式来遍历原始的最大值列表，以获取这些行的索引。

indices = [i for i, x in enumerate(maximums) if x in k]

根据您想要做的其他事情，我们可以从这里开始。

虽然不是最快的程序，但在一个已经加载好的50,000乘以50,000矩阵上查找最大值、计数器和指数需要182秒。