问题
我目前拥有一个来自这个 Kaggle 数据集中的房产信息的 pandas dataframe。以下是该数据集中的示例 dataframe:
| neighborhood | borough | block | year built | ... |
------------------------------------------------------
| Annadale | 5 | 5425 | 2015 | ... |
| Woodside | 4 | 2327 | 1966 | ... |
| Alphabet City | 1 | 396 | 1985 | ... |
| Alphabet City | 1 | 405 | 1996 | ... |
| Alphabet City | 1 | 396 | 1986 | ... |
| Alphabet City | 1 | 396 | 1992 | ... |
| Alphabet City | 1 | 396 | 0 | ... |
| Alphabet City | 1 | 396 | 1990 | ... |
| Alphabet City | 1 | 396 | 1984 | ... |
| Alphabet City | 1 | 396 | 0 | ... |
我想做的是将"建造年份"列中值为零的每一行替换为在相同社区、区域和街区内具有相同特征的行的"建造年份"值的中位数。在某些情况下,{社区、区域、街区}集合中存在多个具有"建造年份"列中零值的行。这在上面的示例数据帧中是显示出来的。
为了说明问题,我将这两行放到示例数据帧中。
| neighborhood | borough | block | year built | ... |
------------------------------------------------------
| Alphabet City | 1 | 396 | 0 | ... |
| Alphabet City | 1 | 396 | 0 | ... |
为了解决这个问题,我想使用所有其他行中具有相同邻居、区和街区的“建造年份”值的平均值来填写“建造年份”列中为零的行。对于示例行,邻里是"Alphabet City",市区是1,街区是396,因此我将使用以下匹配行从示例数据框中计算平均值:
| neighborhood | borough | block | year built | ... |
------------------------------------------------------
| Alphabet City | 1 | 396 | 1985 | ... |
| Alphabet City | 1 | 396 | 1986 | ... |
| Alphabet City | 1 | 396 | 1992 | ... |
| Alphabet City | 1 | 396 | 1990 | ... |
| Alphabet City | 1 | 396 | 1984 | ... |
我会取那些行的“year built”列的平均值(即1987.4),并用这个平均值替换零值。原本有零值的行将变成这样:
| neighborhood | borough | block | year built | ... |
------------------------------------------------------
| Alphabet City | 1 | 396 | 1987.4 | ... |
| Alphabet City | 1 | 396 | 1987.4 | ... |
我目前的代码
我所做的只是删除“建造年份”列中为零的行,并找到每个{邻域,区,街区}集合的平均年份。原始数据帧存储在raw_data中,它看起来像本帖子顶部的示例数据帧。代码如下:
# create a copy of the data
temp_data = raw_data.copy()
# remove all rows with zero in the "year built" column
mean_year_by_location = temp_data[temp_data["YEAR BUILT"] > 0]
# group the rows into {neighborhood, borough, block} sets and take the mean of the "year built" column in those sets
mean_year_by_location = mean_year_by_location.groupby(["NEIGHBORHOOD","BOROUGH","BLOCK"], as_index = False)["YEAR BUILT"].mean()
输出结果如下:
| neighborhood | borough | block | year built |
------------------------------------------------
| .... | ... | ... | ... |
| Alphabet City | 1 | 390 | 1985.342 |
| Alphabet City | 1 | 391 | 1986.76 |
| Alphabet City | 1 | 392 | 1992.8473 |
| Alphabet City | 1 | 393 | 1990.096 |
| Alphabet City | 1 | 394 | 1984.45 |
那么,我该如何使用mean_year_by_location数据框中的平均“年份”值来替换原始raw_data数据框中的零值?
对于这篇长篇文章,我表示歉意。我只是想表述得更加清晰。