如何在 Pandas 中找到与另一整列最接近的元素值？

Question

如何在 Pandas 中找到与另一整列最接近的元素值？

4

我从这里得到了一张表格：https://deepnote.com/project/vacunacion-en-Espana-vs-covid-19-UhxqL4bdSfGJjeyb1WDM6A/%2Fnotebook.ipynb。这是一个Jupyter Notebook，每天下载西班牙的疫苗接种数据，并将其转换为不同的演变表。具体表格如下：

日期	接种剂量（管理员）	完成剂量
210104	82834	0
210107	207323	0
210108	277976	0
210111	406091	0
210112	488041	0
210113	581638	0
210114	676186	0
210115	768950	0
210118	897942	4630
210119	966097	18682
210120	1025937	31284
210121	1103301	98112
210122	1165825	136912
210125	1237593	177396
210126	1291216	247394
210127	1356461	346132
210128	1395618	385518
210129	1474189	503732
210201	1609261	715784
210202	1673054	837038
210203	1764778	997956
210204	1865342	1172244
210205	1988160	1365818
210208	2105033	1572814
210209	2167241	1677564
210210	2233249	1779366
210211	2320507	1886556
210212	2423045	2000970
210215	2561608	2140182
210216	2624512	2193844
210217	2690457	2238360
210218	2782751	2289112
210219	2936011	2342052
210222	3090351	2394122
210223	3165191	2416610
210224 我想知道每个“完整剂量”相较于整个“接种剂量”的最近值，以便知道完成疫苗接种所需的天数。例如，在210129时有503732个完整剂量，最接近的值是210112时的488041个接种剂量 - 因此在接种488041剂量和503732剂量之间经过了17天。我尝试了很多选项，但无论是使用原始 pandas 还是 pandas 和 numpy 都没有起作用。在 Excel 中可以这样做: `{=INDEX(A$2:A$56;MATCH(MIN(ABS(B$2:B$56-C7));ABS(B$2:B$56-C7);0))}` 但我无法将其翻译成 Pandas。感谢您提前的任何帮助。

- Juan Luis Chulilla

为什么我们要使用C7？如果我将其加载到Excel中，C7将是0。对于您的示例，我假设我们会查看C18？ - Joe Ferndz

此外，有关此问题的numpy版本，请参见此答案。 - Umar.H

2个回答

1

def find_nearest(series):
    current_idx = series.loc["index"]
    nearest_idx = np.abs(df["admin doses"] - series.loc["complete dosis"]).argmin()
    day_diff = (days_in_dt[current_idx] - days_in_dt[nearest_idx]).days
    return day_diff

# convert Day column to TimeStamps
days_in_dt = pd.to_datetime(df.Day, format="%y%m%d")

# the result
df["complete dosis"].reset_index().apply(find_nearest, axis=1)

0      0
1      3
2      4
3      7
4      8
5      9
6     10
7     11
8     14
9     15
10    16
11    17
12    18
13    18
14    18
15    16
16    17
17    17
18    18
19    15
20    14
21    13
22     9
23     7
24     7
25     7
26     7
27     7
28     6
29     7
30     7
31     7
32     8
33    10
34    11
35    12
36    13
37    14
38    14
39    15
40    16
41    16
42    16
43    18
44    19
45    19
46    20
47    17
48    18
49    19
50    19
51    17
52    18
53    19
54    19
dtype: int64

我们应用一个函数来找到最近值的日期差异。传递给函数的值的形式为：

index                8
complete dosis    4630
Name: 8, dtype: int64

这是为了能够获取它们的索引值并稍后计算差异。在应用之前使用reset_index可以将此索引信息传递给传递的系列。

- Mustafa Aydın

现在，当我测试它时，出现了这个错误：TypeError: reduction operation 'argmin' not allowed for this dtype应该是当argmin与没有数值的情况下使用时，会出现这个错误消息，但据我所知，这不是这种情况。问题出在哪里？再次感谢您的帮助！ - Juan Luis Chulilla

@JuanLuisChulilla 或许可以尝试在应用该函数之前使用 df[["admin doses", "complete dosis"]] = df[["admin doses", "complete dosis"]].astype(int) 确保它们是整数类型？ - Mustafa Aydın

非常感谢！它像魔法一样运行。现在，我正在努力理解series.loc如何指向“complete dosis”的每个单独元素。显然，df.column指向整个列，但是在阅读了Pandas文档之后，我还没有弄清楚在这种情况下series.loc的工作原理。 - Juan Luis Chulilla

@JuanLuisChulilla 我们在数据框上使用 .apply，并设置 axis=1。(df["complete dosis"] 本身是一个序列，但我们对其进行了 reset_index 操作，它变成了一个数据框；您可以单独运行它以查看其行为)。axis=1 表示 pandas 将这个数据框的行逐一发送到函数 find_nearest 中。正如答案的最后一部分所述，该函数每次都会看到这种形式的序列，即数据框的列名成为索引名称。然后使用 .loc 来访问相应的值，例如 series.loc["index"] 对于最后一个示例来说是 8 等等。 - Mustafa Aydın

1

谢谢！！非常感谢！！Sukran！！ - Juan Luis Chulilla

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- zipa · Accepted Answer

要找到这样的匹配项，您可以使用merge_asof:

result = pd.merge_asof(df, df.sort_values('admin doses'), left_on='complete dosis', right_on='admin doses')

这只会为您设置值，然后您可以开始计算您所需的内容。