Pandas数据框架applymap并行执行

Question

Pandas数据框架applymap并行执行

pythonpandasdataframeparallel-processingpython-multiprocessing

3

我有以下函数，可以将一组正则表达式应用于数据框中的每个元素。我要对的数据框是一个5MB的分块。

def apply_all_regexes(data, regexes):
    # find all regex matches is applied to the pandas' dataframe
    new_df = data.applymap(
        partial(apply_re_to_cell, regexes))
    return regex_applied

def apply_re_to_cell(regexes, cell):
    cell = str(cell)
    regex_matches = []
    for regex in regexes:
        regex_matches.extend(re.findall(regex, cell))
    return regex_matches

由于applymap的串行执行，处理所需时间为 ~ 元素数 * (1个元素的正则表达式的串行执行)。有没有办法调用并行处理？我尝试使用ProcessPoolExecutor，但似乎比串行执行需要更长时间。

- Sushim Mukul Dutta

2个回答

0

稍微现代一点的版本：

from concurrent.futures import ThreadPoolExecutor
from tqdm.auto import tqdm

tqdm.pandas()

def parallel_applymap(df, func, worker_count):
    def _apply(shard):
        return shard.progress_applymap(func)

    shards = np.array_split(df, worker_count)
    with ThreadPoolExecutor(max_workers=worker_count) as e:
        futures = e.map(_apply, shards)
    return pd.concat(list(futures))

- gebbissimo

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mibi · Accepted Answer

您是否尝试将一个大数据框分成多个线程小数据框，对其应用正则表达式映射并将每个小数据框重新组合？

我曾经使用基因表达的数据框做过类似的事情。建议先小规模运行，并控制期望输出。

很遗憾，我的声望还不够高，无法发表评论。

def parallelize_dataframe(df, func):
    df_split = np.array_split(df, num_partitions)
    pool = Pool(num_cores)
    for x in df_split:
        print(x.shape)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()


    return df

这是我使用的通用函数。