Pandas DataFrame iloc 破坏了数据类型

Question

Pandas DataFrame iloc 破坏了数据类型

11

我有 pandas 0.19.2 版本。

以下是一个示例：

testdf = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [1.0, 2.0, 3.0, 4.0]})
testdf.dtypes

输出：

A      int64
B    float64
dtype: object

目前看起来一切都很好，但我不喜欢的是（注意，第一个调用是pd.Series.iloc，第二个调用是pd.DataFrame.iloc）

print(type(testdf.A.iloc[0]))
print(type(testdf.iloc[0].A))

输出：

<class 'numpy.int64'>
<class 'numpy.float64'>

在尝试理解为什么 pd.DataFrame.join() 操作返回了两个 int64 列几乎没有交集的情况时，我找到了它。我的猜测是因为类型不一致可能与此行为有关，但我不确定……我的简短调查揭示了上面的事情，现在我有点困惑。

如果有人知道如何解决它 - 我将非常感激任何提示！

更新

感谢 @EdChum 的评论。所以这里是我的生成数据和连接/合并行为的示例：

testdf.join(testdf, on='A', rsuffix='3')

    A   B   A3  B3 
0   1   1.0 2.0 2.0
1   2   2.0 3.0 3.0
2   3   3.0 4.0 4.0
3   4   4.0 NaN NaN

而下面的代码

pd.merge(left=testdf, right=testdf, on='A')

被认为是相同的，并返回

    A   B_x B_y
0   1   1.0 1.0
1   2   2.0 2.0
2   3   3.0 3.0
3   4   4.0 4.0

更新2： 模仿@EdChum关于join和merge行为的评论。问题在于A.join(B，on ='C')将使用A中的索引并将其与列B [ 'C'] 连接，因为默认情况下join使用索引。在我的情况下，我只是使用了merge来获得期望的结果。

- ghastly_kitten

2

iloc 返回你的行系列，由于不存在既能满足 int 又能满足 float 的 dtype，因此显示了 object，但是如果你的行是混合类型，那么问题出在哪里呢？ - EdChum

如果您正在尝试匹配的列是 int64 类型，则值比较应该按预期工作；如果它们是 float 类型，则可能会遇到精度问题，这与您上面展示的内容无关。 - EdChum

@EdChum 好的，谢谢，你的观点解释了我的例子。我的列不是浮点数，所以这是某个地方的问题。例如，我可以在两个表中手动查找指定的值，但是连接失败了。 - ghastly_kitten

2

请注意，默认情况下 join 会尝试在索引上进行连接，而 merge 则会尝试在列上进行合并。它们在语义上是不同的，但是根据传递的参数，您可以获得相同的结果。 - EdChum

哦，我明白了。是的，那应该就是问题所在。 - ghastly_kitten

显示剩余3条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- piRSquared · Accepted Answer

这是预期的结果。pandas按列跟踪数据类型。当您调用testdf.iloc[0]时，您正在请求一行。它必须将整行转换为系列。该行包含一个浮点数。因此，作为系列的行必须为浮点数。

然而，当pandas使用loc或iloc时，它在使用单个__getitem__时进行此转换。

以下是一个具有一个int列的testdf的一些有趣的测试用例。

testdf = pd.DataFrame({'A': [1, 2, 3, 4]})

print(type(testdf.iloc[0].A))
print(type(testdf.A.iloc[0]))

<class 'numpy.int64'>
<class 'numpy.int64'>

将其更改为OP测试用例。

testdf = pd.DataFrame({'A': [1, 2, 3, 4], 'B': [1.0, 2.0, 3.0, 4.0]})

print(type(testdf.iloc[0].A))
print(type(testdf.A.iloc[0]))

<class 'numpy.float64'>
<class 'numpy.int64'>

print(type(testdf.loc[0, 'A']))
print(type(testdf.iloc[0, 0]))
print(type(testdf.at[0, 'A']))
print(type(testdf.iat[0, 0]))
print(type(testdf.get_value(0, 'A')))

<class 'numpy.float64'>
<class 'numpy.float64'>
<class 'numpy.int64'>
<class 'numpy.int64'>
<class 'numpy.int64'>

因此，似乎当 pandas 使用 loc 或 iloc 时，它会在行之间进行一些转换，我仍然不完全理解。我相信这与 loc 和 iloc 的本质不同于 at、iat 和 get_value 有关，因为 iloc 和 loc 允许您使用索引数组和布尔数组访问数据帧。而 at、iat 和 get_value 只能一次访问单个单元格。

尽管如此

testdf.loc[0, 'A'] = 10

print(type(testdf.at[0, 'A']))

当我们通过loc分配给该位置时，pandas确保dtype保持一致。