Python中的R expand.grid()函数

Question

Python中的R expand.grid()函数

76

有没有类似于R中expand.grid()函数的Python函数？谢谢。

(EDIT) 以下是此R函数的描述和示例。

Create a Data Frame from All Combinations of Factors

Description:

     Create a data frame from all combinations of the supplied vectors
     or factors.  

> x <- 1:3
> y <- 1:3
> expand.grid(x,y)
  Var1 Var2
1    1    1
2    2    1
3    3    1
4    1    2
5    2    2
6    3    2
7    1    3
8    2    3
9    3    3

（EDIT2）以下是使用rpy包的示例。我想获取相同的输出对象，但不使用R：

>>> from rpy import *
>>> a = [1,2,3]
>>> b = [5,7,9]
>>> r.assign("a",a)
[1, 2, 3]
>>> r.assign("b",b)
[5, 7, 9]
>>> r("expand.grid(a,b)")
{'Var1': [1, 2, 3, 1, 2, 3, 1, 2, 3], 'Var2': [5, 5, 5, 7, 7, 7, 9, 9, 9]}

编辑 02/09/2012： 我对Python感到非常困惑。Lev Levitsky在他的回答中提供的代码对我来说不起作用：

>>> a = [1,2,3]
>>> b = [5,7,9]
>>> expandgrid(a, b)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in expandgrid
NameError: global name 'itertools' is not defined

然而itertools模块似乎已经安装了（输入from itertools import *没有返回任何错误信息）

- Stéphane Laurent

7

最有可能提供帮助的人是Python用户。由于他们可能不熟悉R语言，您可以提供expand.grid函数的摘要说明吗？甚至提供一个小例子会更好。 - GSee

1

@DavidRobinson pandas Python包处理的对象非常接近于R数据框。理想情况下，我希望有这样一个对象。 - Stéphane Laurent

1

看起来基本上是笛卡尔积，所以如果找不到标准解决方案，使用 itertools.product 实现应该不太难。 - Lev Levitsky

2

一个令人失望的地方是，这个问题使用了一个双变量的例子，但是R的expand.grid功能更加强大。我会用它来快速生成复杂因子水平的大型数组。因此，一些答案是针对解决(x, y)输出情况而不是适用于任何n个输入的情况。 - Hendy

1

@Hendy itertools.product 也适用于三个或更多向量。请参见我在 @Ahmed 的答案下的示例。 - Paul Rougieux

显示剩余3条评论

11个回答

31

itertools库中的product函数是解决问题的关键。它可以生成输入参数的笛卡尔积。

from itertools import product

def expand_grid(dictionary):
   return pd.DataFrame([row for row in product(*dictionary.values())], 
                       columns=dictionary.keys())

dictionary = {'color': ['red', 'green', 'blue'], 
              'vehicle': ['car', 'van', 'truck'], 
              'cylinders': [6, 8]}

>>> expand_grid(dictionary)
    color  cylinders vehicle
0     red          6     car
1     red          6     van
2     red          6   truck
3     red          8     car
4     red          8     van
5     red          8   truck
6   green          6     car
7   green          6     van
8   green          6   truck
9   green          8     car
10  green          8     van
11  green          8   truck
12   blue          6     car
13   blue          6     van
14   blue          6   truck
15   blue          8     car
16   blue          8     van
17   blue          8   truck

- Alexander

与numpy的meshgrid相比，这个方法很好但速度非常慢，而且不比列表理解式快。对于3000x3000，使用np.array(list(product(range(3000), range(3000))))需要4.7秒，而np.meshgrid(range(3000), range(3000))只需81毫秒。列表理解式则需要6.8秒。虽然如此，它至少与线性代数术语兼容，这很好。 - Thomas Browne

相比之下，[row for row in product(*dictionary.values())] 的时间是多少？ - Alexander

d = {1: range(3000), 2: range(3000)}; %timeit [r for r in product(*d.values())] ..... 回答为1.68秒。非常好，非numpy的胜利者！而且奖金适用于非数字。 - Thomas Browne

另一个好处是，R中的expand.grid函数提供了列名，而其他答案都没有。实际上，我一直在尝试实现已接受答案的dict版本，因为这个原因。然后我向下滚动，发现你已经做到了！免费获得列名真是太好了。 - Hendy

22

Pandas文档定义了一个expand_grid函数：

def expand_grid(data_dict):
    """Create a dataframe from every combination of given values."""
    rows = itertools.product(*data_dict.values())
    return pd.DataFrame.from_records(rows, columns=data_dict.keys())

为了让这段代码正常运行，你需要导入以下两个模块：

import itertools
import pandas as pd

输出结果是一个 pandas.DataFrame，这是 Python 中最接近 R 中 data.frame 对象的对象。

- Daniel Himmelstein

21

这是一个例子，可以输出类似于您所需的内容：

import itertools
def expandgrid(*itrs):
   product = list(itertools.product(*itrs))
   return {'Var{}'.format(i+1):[x[i] for x in product] for i in range(len(itrs))}

>>> a = [1,2,3]
>>> b = [5,7,9]
>>> expandgrid(a, b)
{'Var1': [1, 1, 1, 2, 2, 2, 3, 3, 3], 'Var2': [5, 7, 9, 5, 7, 9, 5, 7, 9]}

这两者的区别在于，itertools.product函数中最右边的元素会在每次迭代时向前移动。如果重要的话，您可以通过巧妙地排序product列表来调整函数。

编辑（由S. Laurent）

要得到与R相同的结果：

def expandgrid(*itrs): # https://dev59.com/eGct5IYBdhLWcg3wZMfn#12131385
    """
    Cartesian product. Reversion is for compatibility with R.
    
    """
    product = list(itertools.product(*reversed(itrs)))
    return [[x[i] for x in product] for i in range(len(itrs))][::-1]

- Lev Levitsky

@StéphaneLaurent 你在使用 itertools.product 之前是否已经执行了 import itertools - Lev Levitsky

18

我一直对此有疑问，迄今为止提出的解决方案都没有让我满意，因此我想出了自己的方案，它要简单得多（但可能会更慢）。该函数使用numpy.meshgrid创建网格，然后将网格展平成1d数组并将它们组合在一起：

def expand_grid(x, y):
    xG, yG = np.meshgrid(x, y) # create the actual grid
    xG = xG.flatten() # make the grid 1d
    yG = yG.flatten() # same
    return pd.DataFrame({'x':xG, 'y':yG}) # return a dataframe

例如：

import numpy as np
import pandas as pd

p, q = np.linspace(1, 10, 10), np.linspace(1, 10, 10)

def expand_grid(x, y):
    xG, yG = np.meshgrid(x, y) # create the actual grid
    xG = xG.flatten() # make the grid 1d
    yG = yG.flatten() # same
    return pd.DataFrame({'x':xG, 'y':yG})

print expand_grid(p, q).head(n = 20)

我知道这是一个旧帖子，但我想分享我的简单版本！

- Nate

3

对于任意数量的参数：def expand_grid(*args): mesh = np.meshgrid(*args); return pd.DataFrame(m.flatten() for m in mesh) 。意思是根据输入的参数展开一个网格，并将结果存储在Pandas数据框中返回。 - Richard Border

12

根据以上解决方案，我做了这件事

import itertools
import pandas as pd

a = [1,2,3]
b = [4,5,6]
ab = list(itertools.product(a,b))
abdf = pd.DataFrame(ab,columns=("a","b"))

以下是输出结果

- Ahmed Attia

1

谢谢，itertools.product 也可以很好地处理三个向量：numpy.array(list(itertools.product([0,1], [0,1], [0,1])))。 - Paul Rougieux

嗨，保罗，你知道如何使用最后一个列表来加权第一和第二个向量之间的边吗？谢谢 - BlindSide

6

Scikit中的ParameterGrid函数与R中的expand_grid函数相同。

例如：

from sklearn.model_selection import ParameterGrid
param_grid = {'a': [1,2,3], 'b': [5,7,9]}
expanded_grid = ParameterGrid(param_grid)

您可以将内容转换为列表形式进行访问：

list(expanded_grid))

输出：

[{'a': 1, 'b': 5},
 {'a': 1, 'b': 7},
 {'a': 1, 'b': 9},
 {'a': 2, 'b': 5},
 {'a': 2, 'b': 7},
 {'a': 2, 'b': 9},
 {'a': 3, 'b': 5},
 {'a': 3, 'b': 7},
 {'a': 3, 'b': 9}]

按索引访问元素

list(expanded_grid)[1]

你会得到类似这样的东西：

{'a': 1, 'b': 7}

只需添加一些用法...您可以使用与上面打印的类似的字典列表作为**kwargs传递给函数。例如：

def f(a,b): return((a+b, a-b))
list(map(lambda x: f(**x), list(expanded_grid)))

输出：

[(6, -4),
 (8, -6),
 (10, -8),
 (7, -3),
 (9, -5),
 (11, -7),
 (8, -2),
 (10, -4),
 (12, -6)]

- Vinícius .Lopes

你能否提供一个类似于原帖的例子？ - Paul Rougieux

嗨，保罗，我花了一些时间进行编辑，但希望现在你可以完全理解这个例子，匹配原始问题的列表。 - Vinícius .Lopes

4

这里有另一种返回 pandas.DataFrame 的版本:

import itertools as it
import pandas as pd

def expand_grid(*args, **kwargs):
    columns = []
    lst = []
    if args:
        columns += xrange(len(args))
        lst += args
    if kwargs:
        columns += kwargs.iterkeys()
        lst += kwargs.itervalues()
    return pd.DataFrame(list(it.product(*lst)), columns=columns)

print expand_grid([0,1], [1,2,3])
print expand_grid(a=[0,1], b=[1,2,3])
print expand_grid([0,1], b=[1,2,3])

- snth

4

pyjanitor 的 expand_grid() 可能是最自然的解决方案，尤其是如果你有R背景。

使用方法是将 others 参数设置为字典。字典中的项目可以具有不同的长度和类型。返回值是一个 Pandas DataFrame。

import janitor as jn

jn.expand_grid(others = {
    'x': range(0, 4),
    'y': ['a', 'b', 'c'],
    'z': [False, True]
})

- Richie Cotton

0

你尝试过使用 itertools 中的 product 吗？在我看来，这种方法比一些其他方法要容易得多（除了 pandas 和 meshgrid）。请记住，这个设置实际上会将迭代器中的所有项都提取到一个列表中，然后将其转换为 ndarray，因此在处理更高维度的网格时要小心，或者删除 np.asarray(list(combs)) 以避免内存不足，您可以随后引用迭代器来获取特定的组合。我强烈推荐使用 meshgrid 来完成这个任务：

#Generate square grid from axis
from itertools import product
import numpy as np
a=np.array(list(range(3)))+1 # axis with offset for 0 base index to 1
points=product(a,repeat=2) #only allow repeats for (i,j), (j,i) pairs with i!=j
np.asarray(list(points))   #convert to ndarray

我从这里得到以下输出：

array([[1, 1],
   [1, 2],
   [1, 3],
   [2, 1],
   [2, 2],
   [2, 3],
   [3, 1],
   [3, 2],
   [3, 3]])

- ThisGuyCantEven

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Thomas Browne · Accepted Answer

只需使用列表推导式：

>>> [(x, y) for x in range(5) for y in range(5)]

[(0, 0), (0, 1), (0, 2), (0, 3), (0, 4), (1, 0), (1, 1), (1, 2), (1, 3), (1, 4), (2, 0), (2, 1), (2, 2), (2, 3), (2, 4), (3, 0), (3, 1), (3, 2), (3, 3), (3, 4), (4, 0), (4, 1), (4, 2), (4, 3), (4, 4)]

如果需要，将其转换为numpy数组：

>>> import numpy as np
>>> x = np.array([(x, y) for x in range(5) for y in range(5)])
>>> x.shape
(25, 2)

我已经测试了10000 x 10000，发现Python的性能与R中的expand.grid相当。在列表推导式中，使用元组（x，y）比使用列表[x，y]快约40％。

使用np.meshgrid可以提高大约3倍的速度，并且内存占用更少。

%timeit np.array(np.meshgrid(range(10000), range(10000))).reshape(2, 100000000).T
1 loops, best of 3: 736 ms per loop

在R中：

> system.time(expand.grid(1:10000, 1:10000))
   user  system elapsed 
  1.991   0.416   2.424

请注意，R的数组从1开始索引，而Python从0开始。