如何使用布尔掩码在 Pandas DataFrame 中将“任何字符串”替换为 NaN？

Question

如何使用布尔掩码在 Pandas DataFrame 中将“任何字符串”替换为 NaN？

19

我有一个227x4的数据框，其中包含国家名称和数字值需要清理（整理？）。

以下是数据框的抽象表示：

import pandas as pd
import random
import string
import numpy as np
pdn = pd.DataFrame(["".join([random.choice(string.ascii_letters) for i in range(3)]) for j in range (6)], columns =['Country Name'])
measures = pd.DataFrame(np.random.random_integers(10,size=(6,2)), columns=['Measure1','Measure2'])
df = pdn.merge(measures, how= 'inner', left_index=True, right_index =True)

df.iloc[4,1] = 'str'
df.iloc[1,2] = 'stuff'
print(df)

  Country Name Measure1 Measure2
0          tua        6        3
1          MDK        3    stuff
2          RJU        7        2
3          WyB        7        8
4          Nnr      str        3
5          rVN        7        4

如何在不触及国家名称的情况下，将所有列中的字符串值替换为np.nan？

我尝试使用布尔掩码：

mask = df.loc[:,measures.columns].applymap(lambda x: isinstance(x, (int, float))).values
print(mask)

[[ True  True]
 [ True False]
 [ True  True]
 [ True  True]
 [False  True]
 [ True  True]]

# I thought the following would replace by default false with np.nan in place, but it didn't
df.loc[:,measures.columns].where(mask, inplace=True)
print(df)

  Country Name Measure1 Measure2
0          tua        6        3
1          MDK        3    stuff
2          RJU        7        2
3          WyB        7        8
4          Nnr      str        3
5          rVN        7        4


# this give a good output, unfortunately it's missing the country names
print(df.loc[:,measures.columns].where(mask))

  Measure1 Measure2
0        6        3
1        3      NaN
2        7        2
3        7        8
4      NaN        3
5        7        4

我查看了几个和我的问题相关的问题（[1], [2], [3], [4], [5], [6], [7], [8]），但没有找到一个能回答我的问题的。

- Malik Koné

一个元问题，我在这里制定一个问题需要超过3个小时（包括研究），这正常吗？- 是的。[so]和整个Stack Exchange网络的成功建立在其内容的高质量之上，包括问题和回答。你不能在几分钟内草率地提出高质量的问题。就我个人而言，我会把所需的努力放在几天而不是几个小时的顺序上。我肯定花了一整天甚至更长时间来回答问题，我希望提问者花费至少一个数量级更多的精力，因为他才是受益者。 - Jörg W Mittag

附注：关于元问题应在 [meta] 上提出。 - Jörg W Mittag

@JörgWMittag 我只是在数一下在我放弃尝试之后写问题花费的时间。如果我必须数一下，确实会用上几天时间。等我有更多时间时，我会在元社区提出一个问题。花这么长时间问问题让我感到很蠢。但现在我感觉好多了，答案的质量证明了这个努力是完全值得的。谢谢！ - Malik Koné

3个回答

13

仅分配感兴趣的列：

cols = ['Measure1','Measure2']
mask = df[cols].applymap(lambda x: isinstance(x, (int, float)))

df[cols] = df[cols].where(mask)
print (df)
  Country Name Measure1 Measure2
0          uFv        7        8
1          vCr        5      NaN
2          qPp        2        6
3          QIC       10       10
4          Suy      NaN        8
5          eFS        6        4

一个元问题：在这里提出一个问题（包括研究）需要三个小时以上，这正常吗？

我认为是的，创造一个好问题确实很难。

- jezrael

我喜欢你的答案，但为什么 df2 = df.loc[:, measures.columns].where(mask, inplace=True) 没有进行替换呢？而 df.loc[:, measures.columns].where(mask) 却正确地打印了出来。 - Malik Koné

因为 inplace 总是返回 None，所以 df2 是 None。 - jezrael

我已经编辑了这个问题。我不明白为什么 df.loc[:,measures.columns].where(mask, inplace=True) 不会修改 df？ - Malik Koné

1

我认为在给df的副本赋值时存在问题，就像这个中的fillna一样。如果将您的代码更改为df[measures.columns].where(mask)，会收到警告。 - jezrael

9

cols = ['Measure1','Measure2']
df[cols] = df[cols].applymap(lambda x: x if not isinstance(x, str) else np.nan)

或者

df[cols] = df[cols].applymap(lambda x: np.nan if isinstance(x, str) else x)

结果：

In [22]: df
Out[22]:
  Country Name  Measure1  Measure2
0          nBl      10.0       9.0
1          Ayp       8.0       NaN
2          diz       4.0       1.0
3          aad       7.0       3.0
4          JYI       NaN      10.0
5          BJO       9.0       8.0

- MaxU - stand with Ukraine

为什么要使用否定的 x if not isinstance(x, str) 而不是 x if isinstance(int,float) else np.nan`？ - Malik Koné

1

如果不需要否定，那么它将替换所有数字为 nan。代码如下： x: np.nan if isinstance(x, str) else x - Bharath M Shetty

我不想替换数字.. 我想要用nan替换非数字 - Malik Koné

@MalikKoné，我认为你想使用Bharath shetty的解决方案。 - MaxU - stand with Ukraine

所有三个答案对我来说都非常有趣... 我的重点是理解，我还没有优化物理资源的必要。 :o) - Malik Koné

@MalikKoné，如果你的目标是清理数字列——即将所有值替换为数字数据类型，并将那些无法转换为数字数据类型的值替换为NaN，则Bharath Shetty的解决方案是最符合习惯用法的方法。如果你想要将特定数据类型的单元格替换为NaN，则可以在jezrael和我的解决方案之间选择... - MaxU - stand with Ukraine

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Bharath M Shetty · Accepted Answer

使用"numeric with errors coerce"进行数据转换，即

cols = ['Measure1','Measure2']
df[cols] = df[cols].apply(pd.to_numeric,errors='coerce')

国家名称  测量1  测量2
0          PuB       7.0       6.0
1          JHq       2.0       NaN
2          opE       4.0       3.0
3          pxl       3.0       6.0
4          ouP       NaN       4.0
5          qZR       4.0       6.0