pandas DataFrame 合并/更新（"upsert"）？

Question

pandas DataFrame 合并/更新（"upsert"）？

37

我正在寻找一种优雅的方法，将一个DataFrame中的所有行追加到另一个DataFrame中（两个DataFrame具有相同的索引和列结构），但在出现相同索引值的情况下，使用第二个数据框中的行。

例如，如果我从以下数据开始：

df1:
                    A      B
    date
    '2015-10-01'  'A1'   'B1'
    '2015-10-02'  'A2'   'B2'
    '2015-10-03'  'A3'   'B3'

df2:
    date            A      B
    '2015-10-02'  'a1'   'b1'
    '2015-10-03'  'a2'   'b2'
    '2015-10-04'  'a3'   'b3'

我希望你能把结果翻译成：

                    A      B
    date
    '2015-10-01'  'A1'   'B1'
    '2015-10-02'  'a1'   'b1'
    '2015-10-03'  'a2'   'b2'
    '2015-10-04'  'a3'   'b3'

这类似于某些SQL系统中所谓的"upsert"，即将更新和插入组合起来。每一行数据从df2中获取，如果行键已经存在于df1中，则用其更新现有行。如果行键在df1中不存在，则将其插入到df1的末尾。

我得出了以下结论：

pd.concat([df1, df2])     # concat the two DataFrames
    .reset_index()        # turn 'date' into a regular column
    .groupby('date')      # group rows by values in the 'date' column
    .tail(1)              # take the last row in each group
    .set_index('date')    # restore 'date' as the index

这种方法似乎可行，但它依赖于每个groupby组中的行顺序始终与原始DataFrame相同，我没有检查过，并且看起来过于复杂。

有没有更简单直接的解决方案？

- embeepea

3个回答

11

截至pandas 1.0.3版本，所需功能直接由combine_first提供：

combined = df2.combine_first(df1)

print(combined)
#               A   B
# 2015-10-01    A1  B1
# 2015-10-02    a1  b1
# 2015-10-03    a2  b2
# 2015-10-04    a3  b3

要实现这种行为，数据优先级较高的数据框（在这种情况下是df2）必须调用该函数。

它基本上：（1）协调行和列，（2）优先考虑非NaN数据，并且（3）如果两个数据框中都定义了数据点，则优先考虑df2中的数据，这正是您想要的。

编辑：我的理解是combine_first确实满足所请求的“如果存在则更新，如果不存在则插入”的行为。然而，根据评论中的Vijchti（感谢），这并不严格对应于SQL的UPSERT操作，因为逻辑是逐个值应用而不是整个行。我从答案中删除了任何关于UPSERT的参考。

- billjoie

3

UPSERT操作是逐行插入或替换，combine_first操作是逐个值进行操作。这两个操作并不等价。如果使用UPSERT，则新行将完全替换现有行。如果使用combine_first，则新行中的非空值仅替换现有行中的空值（而所有现有的非空值将保持不变）。 - Vijchti

什么是UPSERT？它是否与被接受的答案相同？ - mike01010

4

除了正确的答案外，还要注意两个数据帧中不存在的列：

    df1 = pd.DataFrame([['test',1, True], ['test2',2, True]]).set_index(0)
    df2 = pd.DataFrame([['test2',4], ['test3',3]]).set_index(0)

如果你直接使用上述解决方案，你会得到:

    >>>     1   2
    0       
    test    1   True
    test2   4   NaN
    test3   3   NaN

但是，如果您期望以下输出：

    >>>     1   2
    0       
    test    1   True
    test2   4   True
    test3   3   NaN

只需将语句更改为：

    df1 = pd.concat([df1, df2[~df2.index.isin(df1.index)]])
    df1.update(df2)

- MisterMonk

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Alexander · Accepted Answer

一种解决方案是将df1与df2中的新行连接起来（即索引不匹配的行）。然后使用来自df2的值更新这些行。

df = pd.concat([df1, df2[~df2.index.isin(df1.index)]])
df.update(df2)

>>> df
             A   B
2015-10-01  A1  B1
2015-10-02  a1  b1
2015-10-03  a2  b2
2015-10-04  a3  b3

编辑： 根据@chrisb的建议，这可以进一步简化如下：

pd.concat([df1[~df1.index.isin(df2.index)], df2])

谢谢 Chris！