将整个pandas数据框转换为整数（在pandas 0.17.0中）。

Question

将整个pandas数据框转换为整数（在pandas 0.17.0中）。

74

我的问题很类似于这个，但我需要转换整个数据框而不仅仅是一个系列。函数to_numeric一次只能处理一个系列，并且不能很好地替换已弃用的convert_objects命令。在新的pandas版本中，是否有一种方法可以获得类似于convert_objects(convert_numeric=True)命令的相似结果?

感谢Mike Müller提供的示例。df.apply(pd.to_numeric)非常适用于如果所有值都可以转换为整数的情况。如果我的数据框中有无法转换为整数的字符串怎么办？例如：

df = pd.DataFrame({'ints': ['3', '5'], 'Words': ['Kobe', 'Bryant']})
df.dtypes
Out[59]: 
Words    object
ints     object
dtype: object

然后我可以运行已弃用的函数并得到：

df = df.convert_objects(convert_numeric=True)
df.dtypes
Out[60]: 
Words    object
ints      int64
dtype: object

运行apply命令会出现错误，即使使用try和except也无济于事。

- Bobe Kryant

4个回答

5

使用pd.to_numeric()的被接受答案在需要时将其转换为浮点数。详细阅读问题后，它是关于将任何数字列转换为整数。

这就是为什么被接受的答案需要循环遍历所有列，在最后将数字转换为int。

只是为了完整起见，即使没有使用pd.to_numeric()也是可能的；当然，这并不推荐：

df = pd.DataFrame({'a': ['1', '2'], 
                   'b': ['45.8', '73.9'],
                   'c': [10.5, 3.7]})

for i in df.columns:
    try:
        df[[i]] = df[[i]].astype(float).astype(int)
    except:
        pass

print(df.dtypes)

输出：

a    int32
b    int32
c    int32
dtype: object

编辑： 请注意，这个不被推荐的解决方案过于复杂；pd.to_numeric() 可以简单地使用关键字参数 downcast='integer' 来强制输出整数，感谢您的评论。然而，这仍然缺少在被接受的答案中。

再次更新 根据用户Gary的评论，发现“从 pandas 2.0.1 开始，如果输入系列包含空字符串或 None，则结果 dtype 仍将是 float，即使使用了 downcast='integer'”。这意味着，如果您想确保只获得整数，则第一个答案中的 .astype(float).astype(int) 再次可用。

- questionto42

2

如果所有的“数字”都被格式化为整数（即'5'，而不是'5.0'），则可以在to_numeric函数中使用关键字参数downcast='integer'来强制转换为整数类型：在这个例子中，df.apply(pd.to_numeric, downcast='integer')将返回列a作为整数。 - JJL

1

注意，从 pandas 2.0.1 开始，如果输入的 series 包含空字符串或 None，则即使使用 downcast='integer'，结果 dtype 仍将为 float。 - Gary

1

您可以使用 df.astype() 将系列转换为所需的数据类型。

例如： my_str_df = [['20','30','40']]

然后： my_int_df = my_str_df['column_name'].astype(int) # 这将是 int 类型

- P.R.

2

踩一下。这个问题是关于数据框而不是系列的，你没有解释如何更改整个数据框，它还有浮点列类型的字符串，比如“45.8”。 - questionto42

0

将 pd.to_numeric 应用于 DataFrame，使用 errors='ignore' 参数，并将其赋值回 DataFrame：

df = pd.DataFrame({'ints': ['3', '5'], 'Words': ['Kobe', 'Bryant']})
print ("Orig: \n",df.dtypes)

df.apply(pd.to_numeric, errors='ignore')
print ("\nto_numeric: \n",df.dtypes)

df = df.apply(pd.to_numeric, errors='ignore')
print ("\nto_numeric with assign: \n",df.dtypes)

输出：

Orig: 
 ints     object
Words    object
dtype: object

to_numeric: 
 ints     object
Words    object
dtype: object

to_numeric with assign: 
 ints      int64
Words    object
dtype: object

- Alon Lavian

毫无疑问，如果您想保存更改，您需要重新分配df。这应该只是接受的解决方案下的评论。 - questionto42

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mike Müller · Accepted Answer

所有列都可转换

您可以将该函数应用于所有列：

df.apply(pd.to_numeric)

示例：

>>> df = pd.DataFrame({'a': ['1', '2'], 
                       'b': ['45.8', '73.9'],
                       'c': [10.5, 3.7]})

>>> df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 3 columns):
a    2 non-null object
b    2 non-null object
c    2 non-null float64
dtypes: float64(1), object(2)
memory usage: 64.0+ bytes

>>> df.apply(pd.to_numeric).info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 3 columns):
a    2 non-null int64
b    2 non-null float64
c    2 non-null float64
dtypes: float64(2), int64(1)
memory usage: 64.0 bytes

并非所有列都可转换

pd.to_numeric 函数具有关键字参数 errors:

  Signature: pd.to_numeric(arg, errors='raise')
  Docstring:
  Convert argument to a numeric type.

Parameters
----------
arg : list, tuple or array of objects, or Series
errors : {'ignore', 'raise', 'coerce'}, default 'raise'
    - If 'raise', then invalid parsing will raise an exception
    - If 'coerce', then invalid parsing will be set as NaN
    - If 'ignore', then invalid parsing will return the input

将其设置为ignore，如果无法转换为数字类型，则返回未更改的列。正如Anton Protopopov所指出的那样，最优雅的方法是将ignore作为关键字参数提供给apply()：

>>> df = pd.DataFrame({'ints': ['3', '5'], 'Words': ['Kobe', 'Bryant']})
>>> df.apply(pd.to_numeric, errors='ignore').info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
Words    2 non-null object
ints     2 non-null int64
dtypes: int64(1), object(1)
memory usage: 48.0+ bytes

我之前提议的方法，使用functools模块中的partial函数，更加冗长:

>>> from functools import partial
>>> df = pd.DataFrame({'ints': ['3', '5'], 
                       'Words': ['Kobe', 'Bryant']})
>>> df.apply(partial(pd.to_numeric, errors='ignore')).info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2 entries, 0 to 1
Data columns (total 2 columns):
Words    2 non-null object
ints     2 non-null int64
dtypes: int64(1), object(1)
memory usage: 48.0+ bytes