将pandas中的NaN值替换为另一列基于第二列的众数

Question

将pandas中的NaN值替换为另一列基于第二列的众数

3

我有一个带有两列的pandas数据框，分别是city和country。两个列都包含缺失值。请考虑以下数据框：

temp = pd.DataFrame({"country": ["country A", "country A", "country A", "country A", "country B","country B","country B","country B", "country C", "country C", "country C", "country C"],
                     "city": ["city 1", "city 2", np.nan, "city 2", "city 3", "city 3", np.nan, "city 4", "city 5", np.nan, np.nan, "city 6"]})

我现在想要用剩余数据框中国家城市的众数来填充“city”列中的NaN值，例如对于国家A：城市1被提到一次；城市2被提到两次；因此，在索引为2的位置上用“city 2”填充“city”列。我已经完成了。

cities = [city for city in temp["country"].value_counts().index]
modes = temp.groupby(["country"]).agg(pd.Series.mode)
dict_locations = modes.to_dict(orient="index")
for k in dict_locations.keys():
     new_dict_locations[k] = dict_locations[k]["city"]

现在有了国家和相应城市模式的值，我面临两个问题：

第一：情况'country C'是双峰的 - 关键字包含两个条目。我希望这个关键字能够平等地引用每个条目。真实数据集具有多个模式，因此它将是长度大于2的列表。

第二：我卡在了用与“new_dict_locations”中同一行的“country”单元格中的值对应的值替换“city”中的NaN值上。伪代码如下：“遍历‘city’列；如果您在位置‘temp [i，city]’处找到一个缺失的值，请将该行中的‘country’值（->‘country_tmp’）作为字典‘new_dict_locations’的键；如果键‘country_temp’的字典是一个列表，则从该列表中随机选择一个项目；取返回值（->‘city_tmp’）并用该值填充缺失值的单元格（temp[i，city]）。

我已尝试使用不同的.fillna()和.replace()组合（并阅读this和其他问题），但均无法解决。有人能给我指点吗？

非常感谢。

（注：参考问题根据字典替换单元格中的值；然而，我的参考值在另一列中。）

**编辑** 执行temp["city"].fillna(temp['country'], inplace=True)和temp.replace({'city': dict_locations})会给我一个错误：TypeError: unhashable type: 'dict'（这个错误对于原始数据集是TypeError: unhashable type: 'numpy.ndarray'，但我无法用示例重现它——如果有人知道区别的下落，我会非常高兴听到他们的想法。）

- Ivo

这行代码的意思是“我希望这个键以相等的概率引用每个条目。”你能给出给定情况下的预期输出吗？ - Parth

当我在字典中查找“国家C”时，我希望它从["城市5"，"城市6"]中随机选择一个。 - Ivo

2个回答

2

def get_mode(d):
    for k,v in d.items():
        if len(v)>1 and isinstance(v, np.ndarray):
            d[k]=np.random.choice(list(v), 1, p=[0.5 for i in range(len(v))])[0]
    return d

以下字典将用于填充。

new_dict_locations=get_mode(new_dict_locations)
keys=list(new_dict_locations.keys())
values=list(new_dict_locations.values())

# Filling happens here
temp.city=temp.city.fillna(temp.country).replace(keys, values)

这将产生所需的输出：

country    city
0   country A  city 1
1   country A  city 2
2   country A  city 2
3   country A  city 2
4   country B  city 3
5   country B  city 3
6   country B  city 3
7   country B  city 4
8   country C  city 5
9   country C  city 5
10  country C  city 5
11  country C  city 6

- Parth

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Andy L. · Accepted Answer

使用字典`new_dict_locations`和`map`函数创建一个新的序列`s`，然后再次使用`map`函数并结合`np.random.choice`从数组中选择值。最后，使用`s`的`fillna`方法进行填充。

s = (temp.country.map(new_dict_locations)
                 .map(lambda x: np.random.choice(x) if isinstance(x, np.ndarray) else x))

temp['city'] = temp.city.fillna(s)    

Out[247]:
      country    city
0   country A  city 1
1   country A  city 2
2   country A  city 2
3   country A  city 2
4   country B  city 3
5   country B  city 3
6   country B  city 3
7   country B  city 4
8   country C  city 5
9   country C  city 6
10  country C  city 5
11  country C  city 6

注意：我认为可以使用字典推导将两个map合并成一个。但是，这样做会导致失去随机性。