根据键重新塑造的数组

3

我不知道我想做的确切技术术语,所以我将尝试通过示例进行演示:

我有两个相同长度的向量 ab,如下所示:

In [41]:a
Out[41]:
array([ 0.61689215,  0.31368813,  0.47680184, ...,  0.84857976,
    0.97026244,  0.89725481])

In [42]:b
Out[42]:
array([35, 36, 37, ..., 36, 37, 38])

a包含N个浮点数,b包含N个元素:具有10个不同值的键:35、36、37、...、43、44

我希望获得一个新的矩阵M,其中有10列。第一列包含所有在a中其对应b中键为35的行;第二列在M中包含所有在a中其对应b中键为36的行,以此类推直至第十列在M中。

希望我的表达清晰明了,谢谢。


这是一个数据透视表吗? - askewchan
b是否具有所有键的相等频率?如果不是,则您的结果M将不具有等长列,因此无法存储为普通的numpy数组或矩阵。 - askewchan
1
@askewchan:哎呀,原来是这样啊。关键频率不一样。 - luffe
@askewchan,HYRY的回答对我有用,当键freq不匹配时,它会给我NaN。但出于好奇,是否有一个简单的Numpy函数可以在匹配键freq的情况下完成这个任务?谢谢。 - luffe
3个回答

1

itertools.groupby可用于对值进行分组(在排序后)。使用numpy arrays是可选的。

import numpy as np
import itertools
N=50
# a = np.random.rand(50)*100
a = np.random.randint(0,100,N) # int to make printing more compact
b = np.random.randint(35,45, N)

# make structured array to easily sort both arrays together
dtype = np.dtype([('a',float),('b',int)])
ab = np.ndarray(a.shape,dtype=dtype)
ab['a'] = a
ab['b'] = b
# ab = np.sort(ab,order=['b']) # sorts both 'b' and 'a'
I = np.argsort(b,kind='mergesort') # preserves order
ab = ab[I]

# now group, and extract lists of lists
gp = itertools.groupby(ab, lambda x: x['b'])
xx = [list(x[1]) for x in gp]
#print np.array([[y[0] for y in x] for x in xx]) # list of lists

def filled(x):
    M = max(len(z) for z in x)
    return np.array([z+[np.NaN]*(M-len(z)) for z in x])
print filled([[y[1] for y in x] for x in xx]).T
print filled([[y[0] for y in x] for x in xx]).T

生产:
[[ 35.  36.  37.  38.  39.  40.  41.  42.  43.  44.]
 [ 35.  36.  37.  38.  39.  40.  41.  42.  43.  44.]
 [ nan  36.  37.  nan  39.  40.  41.  42.  43.  44.]
 [ nan  36.  37.  nan  39.  40.  41.  42.  43.  44.]
 ...]

[[ 54.  69.  34.  28.  71.  53.  33.  19.  64.  56.]
 [ 90.  52.  11.   9.  50.  53.  25.  37.  69.  56.]
 [ nan  97.  31.  nan  69.  35.   2.  80.  91.  54.]
 [ nan  33.  87.  nan  47.  90.  81.  45.  86.  57.]
 ...]

我正在使用 mergesortargsort 来保留子列表中 a 的顺序。与我对 order 参数的期望相反,np.sort 在字典上同时排序了 ba
另一种方法是使用 Python 字典,也可以保留 a 的顺序。它可能在大型数组上速度较慢,但隐藏的细节较少。
import collections
d = collections.defaultdict(list)
for k,v in zip(b,a):
    d[k].append(v)
values = [d[k] for k in sorted(d.keys())]
print filled(values).T

0

你可以使用pandas:

import numpy as np
import pandas as pd

a = np.random.rand(50)
b = np.random.randint(10, 15, 50)

s = pd.Series(a)
s.groupby(b).apply(pd.Series.reset_index, drop=True).unstack(level=0)

输出结果为:

          10        11        12        13        14
0   0.465079  0.041393  0.692856  0.634328  0.179690
1   0.934678  0.746048  0.060014  0.072626  0.824729
2   0.388190  0.510527  0.078662  0.077157  0.291183
3   0.972033  0.761159  0.017317  0.104768  0.278871
4   0.750713  0.430246  0.083407  0.262037  0.487742
5   0.216965  0.482364  0.820535  0.207008  0.276452
6   0.282038  0.607303  0.675856  0.994369  0.602059
7   0.897106  0.398808  0.312332  0.751388  0.878177
8   0.229121       NaN       NaN  0.061288  0.032066
9   0.810678       NaN       NaN       NaN  0.718237
10  0.571125       NaN       NaN       NaN  0.668292
11  0.410750       NaN       NaN       NaN  0.288145
12  0.984507       NaN       NaN       NaN       NaN

太棒了,行了!但我想我需要花1小时学习Pandas文档,才能理解那行代码的实际作用。出于好奇,你能简单解释一下它为什么有效吗?谢谢。 - luffe

0
这是一种不使用Pandas的方法(因此您需要手动跟踪列标签):
import numpy as np
from itertools import izip_longest
from collections import defaultdict

a = np.random.rand(50)
b = np.random.randint(10, 15, 50)
d = defaultdict(lambda:[])

for i, key_val in enumerate(b):
    d[key_val].append(a[i])

output = np.asarray(list(izip_longest(*(d.values()), 
                                      fillvalue=np.NaN)))

print (a)
print (b)
print (output)

这会产生:

a

array([ 0.98688273,  0.95584584,  0.91011945,  0.56402919,  0.86185936,
        0.09380343,  0.69290659,  0.97238284,  0.81297425,  0.73446398,
        0.25927151,  0.44622982,  0.20537961,  0.61665218,  0.90168399,
        0.58556404,  0.47017152,  0.32278718,  0.15044929,  0.07859976,
        0.26715756,  0.38281878,  0.30169241,  0.47785937,  0.15377038,
        0.93395325,  0.79099068,  0.92471442,  0.03154578,  0.0437627 ,
        0.31711433,  0.78550517,  0.77062104,  0.76002167,  0.1842867 ,
        0.52935392,  0.16038216,  0.46510856,  0.4311615 ,  0.73923847,
        0.45499238,  0.2630405 ,  0.67722848,  0.1391463 ,  0.50800704,
        0.50618842,  0.19540159,  0.38150066,  0.82831838,  0.3383787 ])

b:

array([14, 10, 13, 12, 12, 13, 13, 12, 11, 10, 10, 13, 14, 12, 11, 12, 14,
       12, 12, 14, 11, 10, 13, 13, 13, 10, 14, 11, 13, 11, 11, 11, 12, 10,
       11, 11, 14, 12, 12, 14, 13, 10, 11, 14, 13, 11, 10, 11, 12, 12])

输出:

array([[ 0.95584584,  0.81297425,  0.56402919,  0.91011945,  0.98688273],
       [ 0.73446398,  0.90168399,  0.86185936,  0.09380343,  0.20537961],
       [ 0.25927151,  0.26715756,  0.97238284,  0.69290659,  0.47017152],
       [ 0.38281878,  0.92471442,  0.61665218,  0.44622982,  0.07859976],
       [ 0.93395325,  0.0437627 ,  0.58556404,  0.30169241,  0.79099068],
       [ 0.76002167,  0.31711433,  0.32278718,  0.47785937,  0.16038216],
       [ 0.2630405 ,  0.78550517,  0.15044929,  0.15377038,  0.73923847],
       [ 0.19540159,  0.1842867 ,  0.77062104,  0.03154578,  0.1391463 ],
       [        nan,  0.52935392,  0.46510856,  0.45499238,         nan],
       [        nan,  0.67722848,  0.4311615 ,  0.50800704,         nan],
       [        nan,  0.50618842,  0.82831838,         nan,         nan],
       [        nan,  0.38150066,  0.3383787 ,         nan,         nan]])

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接