Pandas - 提高apply方法的性能

Question

Pandas - 提高apply方法的性能

4

我有一个场景，需要根据同一行中另一列和另一个数据框中的值转换特定列的值。

例子-

print(parent_df)
       school         location      modifed_date
0      school_1       New Delhi     2020-04-06
1      school_2       Kolkata       2020-04-06
2      school_3       Bengaluru     2020-04-06
3      school_4       Mumbai        2020-04-06
4      school_5       Chennai       2020-04-06

print(location_df)
       school          location     
0      school_10       New Delhi
1      school_20       Kolkata     
2      school_30       Bengaluru
3      school_40       Mumbai       
4      school_50       Chennai

根据这个用例，我需要根据同一DataFrame中的“location”列和“location_df”中的位置属性，转换“parent_df”中的学校名称。为了实现这种转换，我编写了以下方法。

def transform_school_name(row, location_df):
    name_alias = location_df[location_df['location'] == row['location']]
    if len(name_alias) > 0:
        return location_df.school.iloc[0]
    else:
        return row['school']

这是我调用该方法的方式

parent_df['school'] = parent_df.apply(UtilityMethods.transform_school_name, args=(self.location_df,), axis=1)

问题是，对于仅有4.6万条记录，我看到整个转换过程大约需要2分钟时间，这太慢了。我应该如何改进这个解决方案的性能？

编辑

以下是我正在处理的实际情况，需要在替换原始列中的值之前进行小的转换。我不确定是否可以在下面的一个答案中提到的replace()方法内完成。

print(parent_df)
       school         location                  modifed_date    type
0      school_1       _pre_New Delhi_post       2020-04-06      Govt
1      school_2       _pre_Kolkata_post         2020-04-06      Private
2      school_3       _pre_Bengaluru_post       2020-04-06      Private
3      school_4       _pre_Mumbai_post          2020-04-06      Govt
4      school_5       _pre_Chennai_post         2020-04-06      Private

print(location_df)
           school          location     type
    0      school_10       New Delhi    Govt
    1      school_20       Kolkata      Private
    2      school_30       Bengaluru    Private

自定义方法代码。

def transform_school_name(row, location_df):
location_values = row['location'].split('_')
name_alias = location_df[location_df['location'] == location_values[1]]
name_alias = name_alias[name_alias['type'] == location_df['type']]
if len(name_alias) > 0:
    return location_df.school.iloc[0]
else:
    return row['school']


def transform_school_name(row, location_df):
    name_alias = location_df[location_df['location'] == row['location']]
    if len(name_alias) > 0:
        return location_df.school.iloc[0]
    else:
        return row['school']

这是我需要处理的实际情况，所以使用 replace() 方法无法帮助。

- Mitaksh Gupta

4个回答

2

据我所知，这更多是一个正则表达式的问题，因为模式并不完全匹配。首先提取所需的模式，创建父df中位置到位置df的映射，然后映射值。

pat =  '.*?' + '(' + '|'.join(location_df['location']) + ')' + '.*?' 

mapping = parent_df['location'].str.extract(pat)[0].map(location_df.set_index('location')['school'])

parent_df['school'] = mapping.combine_first(parent_df['school'])
parent_df


    school      location            modifed_date    type
0   school_10   _pre_New Delhi_post 2020-04-06      Govt
1   school_20   _pre_Kolkata_post   2020-04-06      Private
2   school_30   _pre_Bengaluru_post 2020-04-06      Private
3   school_4    _pre_Mumbai_post    2020-04-06      Govt
4   school_5    _pre_Chennai_post   2020-04-06      Private

- Vaishali

2

我理解的任务是对以下更新进行翻译：

对于每一行在parent_df中，
查找在location_df中与位置匹配（部分匹配location列和type）的行，
如果找到，将在刚刚找到的行中的school列覆盖parent_df中的school。

要执行此操作，请按以下步骤进行：

第1步：生成一个MultiIndex，通过城市和学校类型定位学校名称：

ind = pd.MultiIndex.from_arrays([parent_df.location.str
    .split('_', expand=True)[2], parent_df.type])

对于您的样本数据，结果如下：

MultiIndex([('New Delhi',    'Govt'),
            (  'Kolkata', 'Private'),
            ('Bengaluru', 'Private'),
            (   'Mumbai',    'Govt'),
            (  'Chennai', 'Private')],
           names=[2, 'type'])

不要担心奇怪的一级列名（2），它很快就会消失。

步骤 2：生成“新”位置列表：

locList = location_df.set_index(['location', 'type']).school[ind].tolist()

结果如下：

['school_10', 'school_20', 'school_30', nan, nan]

前三所学校已经找到了一些内容，而后两所则没有。

步骤3: 通过“非空”掩码使用上述列表执行实际更新：

parent_df.school = parent_df.school.mask(pd.notnull(locList), locList)

执行速度

由于使用向量化操作和索引查找，我的代码运行速度比每行应用apply要快得多。

例如：我将您的parent_df复制了10,000次，并使用%timeit检查了您的代码（实际上是稍微修改了一下的版本，如下所述）和我的代码的执行时间。

为了允许重复执行，我改变了两个版本，使它们设置school_2列，而school保持不变。

您的代码运行时间为34.9秒，而我的代码只需要161毫秒，快了261倍。

更快的版本

如果parent_df具有默认索引（从0开始的连续数字），则整个操作可以使用单个指令执行：

parent_df.school = location_df.set_index(['location', 'type']).school[
    pd.MultiIndex.from_arrays(
        [parent_df.location.str.split('_', expand=True)[2],
         parent_df.type])
    ]\
    .reset_index(drop=True)\
    .combine_first(parent_df.school)

步骤：

location_df.set_index(...) - 将索引设置为2个“条件”列。
.school - 仅保留school列（带有上述索引）。
[...] - 检索由其中定义的MultiIndex指示的元素。
pd.MultiIndex.from_arrays( - 创建MultiIndex。
parent_df.location.str.split('_', expand=True)[2] - MultiIndex的第一级是来自location的“城市”部分。
parent_df.type - MultiIndex的第二级是type。
reset_index(...) - 将MultiIndex更改为默认索引（现在索引与parent_df中的相同）。
combine_first(...) - 使用school中的原始值覆盖到目前为止生成的结果中的NaN值。
parent_df.school = - 将结果保存回school列。为了测试目的，可以将其更改为parent_df ['school_2']以检查执行速度。

根据我的评估，执行时间比我原来的解决方案短9%。

对你的代码进行更正

看一下location_values [1]]。它检索到了pre段，而实际上应该检索到下一个段（城市名称）。
没有必要创建一个基于第一个条件的临时列表，然后缩小它，使用第二个条件进行过滤。您的两个条件（对于location和type的相等性）都可以在单个指令中执行，以便执行时间稍短。
在“正面”情况下返回的值应来自name_alias，而不是location_df。

因此，如果出于某种原因您想保留您的代码，请将相应的片段更改为：

name_alias = location_df[location_df['location'].eq(location_values[2]) &
    location_df['type'].eq(row.type)]
if len(name_alias) > 0:
    return name_alias.school.iloc[0]
else:
    return row['school']

- Valdi_Bo

0

如果我正确理解了问题，您使用apply方法实现的是一种连接操作。Pandas在向量化操作方面表现出色，而其基于C语言实现的连接（“merge”）几乎肯定比基于Python / apply的连接更快。因此，我建议尝试使用以下解决方案：

parent_df["location_short"] = parent_df.location.str.split("_", expand=True)[2]
parent_df = pd.merge(parent_df, location_df, how = "left", left_on=["location_short", "type"], 
                     right_on=["location", "type"], suffixes = ["", "_by_location"])

parent_df.loc[parent_df.school_by_location.notna(), "school"] = \
      parent_df.loc[parent_df.school_by_location.notna(), "school_by_location"]

据我所理解，它能够生成您需要的内容：

- Roy2012

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Quang Hoang · Accepted Answer

你可以使用 map/replace：

parent_df['school'] = parent_df.location.replace(location_df.set_index('location')['school'])

输出：

      school   location modifed_date
0  school_10  New Delhi   2020-04-06
1  school_20    Kolkata   2020-04-06
2  school_30  Bengaluru   2020-04-06
3  school_40     Mumbai   2020-04-06
4  school_50    Chennai   2020-04-06