Pandas将混合类型转换为字符串

Question

Pandas将混合类型转换为字符串

6

给定以下数据框：

DF = pd.DataFrame({'COL1': ['A', 'B', 'C', 'D','D','D'], 
'mixed': [2016.0, 2017.0, 'sweatervest', 20, 209, 21]})
DF

    COL1    mixed
0   A       2016.0
1   B       2017.0
2   C       sweatervest
3   D       20
4   D       209 
5   D       21

我希望将“混合数据”转换为一个对象，使得所有数字都变成整数字符串，而所有字符串都保持不变。期望输出如下:

    COL1    mixed
0   A       2016
1   B       2017
2   C       sweatervest
3   D       20
4   D       209 
5   D       21

背景信息：

'mixed' 最初是取自 CSV 的数据框，其中主要由数字组成，偶尔有一些字符串。当我尝试将其转换为字符串时，一些数字最终以 '.0' 结尾。

- Dance Party

1

您的原始数据末尾带有.0的数字。它们是float类型的。您是否想先将它们转换为int类型？ - TigerhawkT3

我相信是这样的。在我使用的CSV文件中，没有这样的小数。只有在我通过astype(str)将它们转换为字符串后才有小数。所以如果更容易的话，也许我应该从那一步之前开始。否则，我想先将数字值转换为浮点数。 - Dance Party

看起来有些数值带小数点，而有些则没有。因此，我认为需要进行以下操作：将浮点数转换为整数，然后转换为字符串，将整数转换为字符串，保持字符串不变...但是我不知道该怎么做。 - Dance Party

我这么做的主要原因是为了能够连接数据文件，目前看来，似乎一个表的关键字段格式与另一个表不同，我猜这就是为什么连接（pd.merge）失败的原因。我将尝试使用字符串整数对科学计数法进行比较（见下面的注释），看看是否有效。 - Dance Party

@DanceParty。我遇到了与你描述的类似问题（csv文件包含混合类型）。将low_memory=False设置为read_csv()的参数有所帮助。如果我不这样做，即使在csv文件中实际上没有小数点（只有整数值与其他字符串混合，但有许多（> 10k）行），浮点数（作为字符串）也会被读取。 - orange

3个回答

3

df.mixed = df.mixed.apply(lambda elt: str(int(elt)) if isinstance(elt, float) else str(elt))

这行代码对'mixed'列的每个元素调用lambda elt: str(int(elt)) if isinstance(elt, float) else str(elt)函数。

注意：这假设你在问题评论中提到的一样，所有浮点数都可以转换为整数。

- gbrener

我尝试了这个，但是出现了错误：ValueError: 无法将浮点数NaN转换为整数。 - Dance Party

如果你想找出问题的原因，可以将 lambda 分解成一个单独的函数（使用 def 定义），并在表达式周围添加 try-except 块以及一些打印语句。 - gbrener

1

这种方法基于 gbrener 的答案。它遍历数据框以查找混合 dtype 列。对于每个这样的混合列，它首先用 pd.NA 替换所有 nan 值。然后安全地将其值转换为字符串。可以在原地使用，如 unmix_dtypes(df)。它在 Python 3.8 下的 Pandas 1 中进行了测试。

请注意，此答案使用赋值表达式，仅适用于 Python 3.8 或更新版本。但是，它可以轻松修改为不使用它们。

from typing import Union

import pandas as pd

def _to_str(val: Union[type(pd.NA), float, int, str]) -> Union[type(pd.NA), str]:
    """Return a string representation of the given integer, rounded float, or otherwise a string.

    `pd.NA` values are returned as is.

    It can be useful to call `df[col].fillna(value=pd.NA, inplace=True)` before calling this function.
    """
    if val is pd.NA:
        return val
    if isinstance(val, float) and (val % 1 == 0.0):
        return str(int(val))
    if isinstance(val, int):
        return str(val)
    assert isinstance(val, str)
    return val

def unmix_dtypes(df: pd.DataFrame) -> None:
    """Convert mixed dtype columns in the given dataframe to strings.

    Ref: https://dev59.com/LpDea4cB1Zd3GeqPbGiH#61826020/
    """
    for col in df.columns:
        if not (orig_dtype := pd.api.types.infer_dtype(df[col])).startswith("mixed"):
            continue
        df[col].fillna(value=pd.NA, inplace=True)
        df[col] = df[col].apply(_to_str)
        if (new_dtype := pd.api.types.infer_dtype(df[col])).startswith("mixed"):
            raise TypeError(f"Unable to convert {col} to a non-mixed dtype. Its previous dtype was {orig_dtype} and new dtype is {new_dtype}.")

注意：不指定明确的数据类型的一个危险是，例如列["012", "0034", "4"]可能被pd.read_csv读取为整数列，从而不可挽回地丢失前导零。更糟糕的是，如果连接数据框，这种前导零的丢失可能会不一致地发生，导致列值如["012", "12", "34", "0034"]。

- Asclepius

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- JAB · Accepted Answer

3

尝试：

DF['mixed']=DF.mixed.astype(object)

这会导致：

DF['mixed']

0           2016
1           2017
2    sweatervest
3             20
4            209
5             21
Name: mixed, dtype: object

- JAB

我刚试了一下，它把小数转换成了科学计数法。 - Dance Party

然而这个连接起作用了，所以对于我的目的来说，这解决了问题。感谢你们两个。 - Dance Party

顺便问一下，使用DF['mixed'] = DF.mixed.astype(object)和DF.mixed = DF.mixed.astype(object)有什么区别？ - Dance Party

没有区别，但是您是否可以使用点表示法取决于系列的标签。例如，如果字段名称中有空格。 - JAB