Pandas DataFrame中包含字符串值的列使用sum函数时的奇怪行为

Question

Pandas DataFrame中包含字符串值的列使用sum函数时的奇怪行为

3

我有三个关于调查响应的pandas数据框，它们看起来完全相同，但是是以不同的方式创建的：

import pandas as pd

df1 = pd.DataFrame([[1,2,3],[4,5,'hey'],[7,8,9]])

df2 = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]])
df2.loc[1,2] = 'hey'

df3 = pd.DataFrame(index=range(3), columns=range(3))
for i in range(3):
    for j in range(3):
        if (i,j) != (1,2):
            df3.loc[i,j] = i*3 + j + 1
        else:
            df3.loc[i,j] = 'hey'

# df1, df2, df3 look the same as below
   0  1    2
0  1  2    3
1  4  5  hey
2  7  8    9

现在，当我沿着列计算总和时，它们都给出了相同的结果。

sumcol1 = df1.sum()
sumcol2 = df2.sum()
sumcol3 = df3.sum()

# sumcol1, sumcol2, sumcol3 look the same as below
0    12
1    15
dtype: int64

然而，当我沿行求和时，df3 的结果与 df1 和 df2 不同。

此外，似乎当 axis=0 时，包含字符串的列的总和不会被计算，而当 axis=1 时，所有行的总和将被计算，跳过属于包含字符串元素的列的元素。

sumrow1 = df1.sum(axis=1)
sumrow2 = df2.sum(axis=1)
sumrow3 = df3.sum(axis=1)

#sumrow1
0     3
1     9
2    15
dtype: int64

#sumrow2
0     3
1     9
2    15
dtype: int64

#sumrow3
0    0.0
1    0.0
2    0.0
dtype: float64

我有三个问题需要解答。

为什么sumcol1和sumrow1的行为不同？
为什么sumrow1和sumrow3的行为不同？
是否有一种正确的方法可以获得与sumrow1相同的结果，但使用df3？

补充：

Is there a smart way to add only the numerical values while keeping the strings?

My current workaround (thanks to jpp's kind answer):

df = pd.DataFrame([[1,2,3],[4,5,'hey'],[7,8,9]])
df_c = df.copy()
for col in df.select_dtypes(['object']).columns:
    df_c[col] = pd.to_numeric(df_c[col], errors='coerce')
df['sum'] = df_c.sum(axis=1)

#result
   0  1    2   sum
0  1  2    3   6.0
1  4  5  hey   9.0
2  7  8    9  24.0

我正在使用Python 3.6.6和pandas 0.23.4。

- jiao

有趣。如果您尝试使用func = sum和np.sum对df.apply(func，axis = 1)进行操作，会发生什么？ - smci

@smci 都会报错（“不支持的操作数类型 '+：int' 和'str'”，出现在索引1处）。 - jiao

你是否有合法的使用场景需要静默强制/抑制非数字，还是只是出于好奇？（将分类与整数混合？为什么不用np.Nan替换“hey”？）此外，df1.info()和df3.info()显示col3上的数据类型不同，正如jpp所诊断的那样。因此，df1.equals(df3)失败了。我想象中有一些其他的数据帧比较方法可以更详细地指出它们是不同的数据类型；更新：pandas.testing.assert_frame_equal(df1, df3)可以做到这一点。 - smci

1

@smic 是的，我确实有一个使用情况需要保留非数字值。 "pandas.testing.assert_frame_equal" 确实是一个非常有用的方法，感谢你提供的信息！ - jiao

好的，即使在你已经接受了一个答案之后，我仍然花时间研究这个问题。因此，如果你发现我的答案有用，可以给它点赞。另外，请告诉我们你的实际应用场景。将字符串与数字混合，并期望它对数字运算作出正常响应感觉有些奇怪。 - smci

1

@smci，我正在从一些填写格式混乱的调查表中提取信息，并希望使用从包含数字的内容中提取的数字进行统计，同时保留不包含数字的内容的详细信息以供参考。我已经给你的答案点赞了，非常感谢！ - jiao

2个回答

2

根据您的问题和jpp的诊断，数据框在外观上看起来相同，但它们在第三列的dtype上有所不同。

以下是一些比较方法，可以显示出差异：

>>> df1.equals(df3)
False # not so useful, doesn't tell you why they differ

你真正需要的是 pandas.testing.assert_frame_equal ：

>>> import pandas.testing
>>> pandas.testing.assert_frame_equal(df1, df3)

AssertionError: Attributes are different

Attribute "dtype" are different
[left]:  int64
[right]: object

pandas.testing.assert_frame_equal() 有以下一系列有用的参数，你可以根据需要进行自定义：

check_dtype : bool, default True    
Whether to check the DataFrame dtype is identical.

check_index_type : bool / string {‘equiv’}, default False    
Whether to check the Index class, dtype and inferred_type are identical.

check_column_type : bool / string {‘equiv’}, default False    
Whether to check the columns class, dtype and inferred_type are identical.

check_frame_type : bool, default False    
Whether to check the DataFrame class is identical.

check_less_precise : bool or int, default False    
Specify comparison precision. Only used when check_exact is False. 5 digits (False) or 3 digits (True) after decimal points are compared. If int, then specify the digits to compare

check_names : bool, default True    
Whether to check the Index names attribute.

by_blocks : bool, default False    
Specify how to compare internal data. If False, compare by columns. If True, compare by blocks.

check_exact : bool, default False    
Whether to compare number exactly.

check_datetimelike_compat : bool, default False    
Compare datetime-like which is comparable ignoring dtype.

check_categorical : bool, default True    
Whether to compare internal Categorical exactly.

check_like : bool, default False    
If true, ignore the order of rows & columns

- smci

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jpp · Accepted Answer

有几个问题：

主要问题是你构建的df3中的三个序列都具有object dtype，而df1和df2的前两个序列具有dtype=int。
Pandas数据框中的数据是按序列[列]组织和存储的。因此，类型转换是按系列进行的。因此，在“行和列”上求和的逻辑必须不同，并且在混合类型方面未必一致。

要了解第一个问题的情况，你必须认识到Pandas并没有持续检查每次操作后是否选择了最合适的dtype。这将非常昂贵。

你可以自己检查dtypes：

print({'df1': df1.dtypes, 'df2': df2.dtypes, 'df3': df3.dtypes})

{'df1': 0     int64
        1     int64
        2    object
      dtype: object,

 'df2': 0     int64
        1     int64
        2    object
      dtype: object,

 'df3': 0    object
        1    object
        2    object
      dtype: object}

您可以通过一种操作有选择地将转换应用于 df3 ，该操作会检查是否会在转换后产生任何空值：

for col in df3.select_dtypes(['object']).columns:
    col_num = pd.to_numeric(df3[col], errors='coerce')
    if not col_num.isnull().any():  # check if any null values
        df3[col] = col_num          # assign numeric series

print(df3.dtypes)

0     int64
1     int64
2    object
dtype: object

然后您应该看到一致的处理。此时，值得丢弃您原来的df3：没有任何文档说明在每次操作后可以或应该应用持续的系列类型检查。

如果要忽略行或列上的非数字值，则可以通过pd.to_numeric并使用errors='coerce'来强制转换：

df = pd.DataFrame([[1,2,3],[4,5,'hey'],[7,8,9]])

col_sum = df.apply(pd.to_numeric, errors='coerce').sum()
row_sum = df.apply(pd.to_numeric, errors='coerce').sum(1)

print(col_sum)

0    12.0
1    15.0
2    12.0
dtype: float64

print(row_sum)

0     6.0
1     9.0
2    24.0
dtype: float64