高效地查找2D numpy数组中正值的索引范围

Question

高效地查找2D numpy数组中正值的索引范围

5

我是一名有用的助手，可以为您翻译文本。

我有一个很大的numpy数组（通常为500,000x1024的级别，但可能更大），我正在尝试执行一些依赖于数组中正值位置的过程。一个非常小的示例数组可能是：

  [[ 0., 0., 0., 0., 0.,-1.,-1., 0., 0.],
   [ 0., 0., 0., 0., 0., 0., 0., 0., 0.],
   [ 0., 1., 1., 0., 0., 1., 5., 0., 0.],
   [ 0., 1., 1., 0., 0., 0., 1., 0., 0.],
   [ 0., 3., 1., 0., 0., 2., 1., 0., 0.],
   [ 0., 0., 0., 0., 0., 0., 0., 0., 0.],
   [ 0., 1., 0., 0., 0., 1., 1., 0., 0.],
   [ 0., 0., 0., 0., 0., 0., 0., 0., 0.]]

首先，需要替换每行中相距不超过三列的正数之间的任何零。因此，如果我用50替换这些数字，我的示例输出将是：

 [[ 0., 0., 0., 0., 0.,-1.,-1., 0., 0.],
  [ 0., 0., 0., 0., 0., 0., 0., 0., 0.],
  [ 0., 1., 1.,50.,50., 1., 5., 0., 0.],
  [ 0., 1., 1., 0., 0., 0., 1., 0., 0.],
  [ 0., 3., 1.,50.,50., 2., 1., 0., 0.],
  [ 0., 0., 0., 0., 0., 0., 0., 0., 0.],
  [ 0., 1., 0., 0., 0., 1., 1., 0., 0.],
  [ 0., 0., 0., 0., 0., 0., 0., 0., 0.]]

我需要做的第二件事是根据正值范围为每行编写一些信息。例如，使用我的修改后的数组，我需要能够为第三行编写一个声明col[1:7]中有正整数的语句，并为第四行声明col[1:3]和col[6]中有正整数的两个语句。

我已经成功地利用了numpy向量化方法来完成第一个任务，但最终仍然不得不循环遍历列和行（尽管只是在整个数组的子集上）。否则，我会替换给定行中的所有零，而不仅仅是正值之间的零。

但是，似乎没有办法在不使用循环遍历整个数组的情况下完成第二个任务。

for col in arr:
  for row in arr:

我猜我的总体问题是，是否有一种方法可以利用numpy中的矢量化方法来定义每行不同且依赖于下一列值的列索引范围？任何帮助将不胜感激。

- Danielle Dowle

4个回答

0

您可以使用高效的numpy迭代器，如flatiter或nditer

例如，对于您的第二个任务

In [1]: x = array([[ 0., 0., 0., 0., 0.,-1.,-1., 0., 0.],
   ...:            [ 0., 0., 0., 0., 0., 0., 0., 0., 0.],
   ...:            [ 0., 1., 1.,50.,50., 1., 5., 0., 0.],
   ...:            [ 0., 1., 1., 0., 0., 0., 1., 0., 0.],
   ...:            [ 0., 3., 1.,50.,50., 2., 1., 0., 0.],
   ...:            [ 0., 0., 0., 0., 0., 0., 0., 0., 0.],
   ...:            [ 0., 1., 0., 0., 0., 1., 1., 0., 0.],
   ...:            [ 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [2]: islands = []
   ...: fl = x.flat
   ...: while fl.index < x.size:
   ...:     coord = fl.coords
   ...:     if fl.next() > 0:
   ...:         length = 1
   ...:         while fl.next() > 0:
   ...:             length +=1
   ...:         islands.append([coord, length])

In [3]: for (row, col), length in islands:
   ...:     print 'row:%d ; col[%d:%d]' %(row, col, col+length)
row:2 ; col[1:7]
row:3 ; col[1:3]
row:3 ; col[6:7]
row:4 ; col[1:7]
row:6 ; col[1:2]
row:6 ; col[5:7]

- Vrekrer

-1

关于第一个问题：创建一个变量来保存你遇到的第一个正数的索引，并且使用if语句重置位置，如果下一个值是正数并且计数器（从第一个正数开始计算位置的变量）小于3。

关于第二个问题：创建一个数组并添加正数值位置的索引。

 String[] indices = new String[];
 int pos = 0;
 for col in arr:
     for row in arr:
        if(index is positive){
             indices[pos] = "[" + col + ":" + row + "]";
             pos++;
         }

- blaqksilhouette

谢谢您的回答，但这仍然需要使用for循环来循环遍历每一列和行，这正是我试图避免做的。我的数组很大，这需要很长时间。我希望有一种方法可以使用内置函数来完成这个任务，而不需要循环遍历数组。 - Danielle Dowle

你是如何创建数组的？你可以在技术上创建一个包含索引、值以及它是否为正数的对象数组列表。然后，你可以使用for循环来获取并返回所有你想要的元素。这种解决方案的时间复杂度为O(N)。假设你没有在第一次创建数组时使用嵌套的for循环。 - blaqksilhouette

数组的创建是完全独立的，但它们实际上代表了一个掩码，用于实际数据，这些数据保存在一个相同形状的单独数组中。 - Danielle Dowle

各种进程在数据上运行，这个“掩码”数组会随之更新。一旦所有操作完成，它将被传递给将运行上述操作的进程。您有如何实现您的建议的示例吗？由于它们与数据的关系，我不确定是否可以按照您的建议实际创建原始数组。数据结构即2D数组也代表有关数据本身的信息。它是频率x时间，当我写出语句时，它将包括所需的频率和时间。 - Danielle Dowle

第二种方法将使数据创建对象，假设您有一个类： - blaqksilhouette

-1

第二种方法是让数据创建对象，假设您有一个类：

public class Matrix{
   int indicex;
   int indicey;
   double val;
   boolean positiveInt;

   //default constructor
   public Matrix(int indicex, int indicey, double val, boolean positiveInt){
   this.indicex = indicex;
   this.indicey = indicey;
   this.val = val;
   this.positiveInt = positiveInt;
   }    

   //getter
   public boolean isPositive(){
        if(positiveInt == true){
              return true;
        }else{
            return false;
        }

在您的驱动程序类中，您将读取数据并创建一个新的Matrix对象（indexx，indexy，val，true/false）...然后将其放入一个ArrayList中，您可以在其中搜索正数。

List<Matrix> storeObjects = new ArrayList<Matrix>();
some method(){
   Matrix matrixObject = new Matrix(indexx, indexy, val, trueOrFalse);
   storeObjects.add(matrixObject)
 }

 for every object in store objects 
    if(object.isPositive()){
         put object in a separate array of positive objects
     }
  }

- blaqksilhouette

这很有道理，但仍需要对最终数组进行相当多的操作，以获取每行中的列（即每个时间的频率）以便在我的最终语句中编写。我对该建议的主要疑问是，由于我无法使用初始数据数组创建此矩阵，因此据我所知，我仍将不得不对整个数组进行逐元素循环，以首先将其解释为矩阵列表？ - Danielle Dowle

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- JustinFisher · Accepted Answer

很不幸，Numpy在不生成更多数组的情况下无法进行大量处理，因此我担心任何解决方案都需要手动循环，就像您一直在使用的那样，或者创建一个或多个额外的大数组。您可以使用numexpr想出一种非常快速和内存高效的解决方案。

这里提供了一种方法，虽然不一定内存高效，但至少所有循环都将由Numpy完成，因此应该比您一直在做的要快得多，只要它适合您的内存。(通过重写一些内容为原地操作，内存效率可能会得到改善，但我不会担心这个问题。)

以下是步骤1：

positive = x>0 # a boolean array marking the positive values in x

positive0 = positive[:,0:-3] # all but last 3 columns 
positive1 = positive[:,1:-2] # all but 1st and last 2 columns; not actually used
positive2 = positive[:,2:-1] # all but first 2 and last 1 columns
positive3 = positive[:,3:  ] # all but first 3 columns

# In the following, the suffix 1 indicates that we're viewing things from the perspective
# of entries in positive1 above.  So, e.g., has_pos_1_to_left1 will be True at
# any position where an entry in positive1 would be preceded by a positive entry in x

has_pos_1_to_left1 = positive0
has_pos_1_or_2_to_right1 = positive2 | positive3
flanked_by_positives1 = has_pos_1_to_left1 & has_pos_1_or_2_to_right1

zeros = (x == 0)       # indicates everywhere x is 0
zeros1 = zeros[:,1:-2] # all but 1st and last 2 columns

x1 = x[:,1:-2]         # all but 1st and last 2 columns

x1[zeros1 & flanked_by_positives1] = 50 # fill in zeros that were flanked - overwrites x!

# The preceding didn't address the next to last column, b/c we couldn't
# look two slots to the right of it without causing error.  Needs special treatment:
x[:,-2][ zeros[:,-2] & positive[:,-1] & (positive[:,-4] or positive[:,-3])] = 50

这是第二步：

filled_positives = x>0 # assuming we just filled in x
diffs = numpy.diff(filled_positives) # will be 1 at first positive in any sequence,
                                     # -1 after last positive, zero elsewhere

endings = numpy.where(diffs==-1) # tuple specifying coords where positive sequences end 
                                 # omits final column!!!
beginnings = numpy.where(diffs==1) # tuple specifying coords where pos seqs about to start
                                   # omits column #0!!!

使用这些起始和结束坐标来提取每一行的信息应该很简单，但请记住，这种差异检测方法只能捕获从非正到正或反之间的转换，因此它不会提及从第零列开始或以最后一列结尾的正序列，所以如果您需要这些内容，您需要单独查找非转换部分。