Python:Pandas,数据框中计算子字符串出现次数的最快方法

6
我有一个数据框,大约有10万行。数据框中的“sequence”列有一些长字符串作为其值。我想要计算“sequence”列中重叠子字符串的出现次数(即如果值为aaaaaa,子字符串为aa,则频率应为5,而不是3)
以下是类似于实际代码的可重复代码。
import pandas as pd
import re
import itertools
import time
from random import choice

# generate 100,000 random strings of fixed size for demo purpose
alphabet = "abcdefghijk"
str_list = []

for i in range(100000):
     str_list.append(''.join(choice(alphabet) for i in range(100)))
# make pandas dataframe
df = pd.DataFrame(columns=['sequence'], data=str_list)

# get a list of substrings to count its frequency in the dataframe
# for the sake of demo, make substrings from "alphabet" with length of 3
# actual application can have up to substrings of length 5 (i.e. 11^5 substrings)
words = [''.join(p) for p in itertools.product(alphabet, repeat=3))]

# calculate frequency of words in the dataframe
for word in words:
    tic = time.time()
    df['frequency'] = df['sequence'].apply(lambda x: len(re.findall('(?={0})'.format(word), x)))
    print("{}s took for one iteration".format(time.time() - tic))

请注意,Pandas内置函数“pd.Series.str.count”将计算子字符串出现的次数而不重叠,因此我必须将其与正则表达式一起使用。
问题在于,在我的计算机上,每个子字符串的频率计算需要大约0.5秒的时间,由于有11^3到11^5个子字符串,最坏情况下可能需要多达80,000秒(或11小时)。
似乎lambda操作会减慢计算时间,因为它必须在Python中编译,而不是Cython(无疑比Python快),但我不知道还有什么其他方法可以做到这一点。
我想知道是否有加速此操作的方法?
附言:我正在使用Python 3.5.2
1个回答

0
让我们先退后一步: 使用您的方法,您为Series中的每一行调用一个lambda函数。在调用时,它必须编译一个正则表达式模式,并使用它来搜索一个字符串。您将在lambda调用和正则表达式编译中浪费一些时间。
加快lambda调用的速度可以通过不使用lambda来实现!而是使用矢量化方法。您已经找到了Series.str.count,但还有Series.str.findall,它与您的lambda做的事情相同。 提速:+/- 1.15倍。 (每个单词的平均执行时间从0.167秒减少到0.127秒)
真正的问题是正则表达式。编译正则表达式是低效的,因此我们应该尽量减少编译次数。一种方法是将您要查找的“单词”批处理成一个单一的正则表达式。在下面的示例中,我将批处理大小设置为100,但您当然可以进行调整。 提速:+/- 33倍。 (每个单词的平均执行时间从0.167秒减少到0.005秒)。
import itertools
import re
import string
import time
from random import choice

import pandas as pd

# Generate 100,000 random strings of fixed size for demo purpose
alphabet: list[str] = list(string.ascii_lowercase[:11])

# Limit the number of samples to 100 for demo purpose
limit = 100

# Make pandas dataframe
df = pd.DataFrame(
    columns=["sequence"],
    data=["".join(choice(alphabet) for _ in range(100)) for _ in range(100000)],
)

# Get a list of substrings to count its frequency in the dataframe for
# the sake of demo, make substrings from "alphabet" with length of 3
# actual application can have up to substrings of length 5 (i.e. 11^5
# substrings)
words: list[str] = ["".join(p) for p in itertools.product(alphabet, repeat=3)]

# ORIGINAL METHOD
# ======================================================================
df["frequency_original"] = 0
times_original: list[float] = []
start = time.time()
for word in words[:limit]:
    tic = time.time()
    df["frequency_original"] = df["frequency_original"] + df["sequence"].apply(
        lambda x: len(re.findall("(?={0})".format(word), x))
    )
    times_original.append(time.time() - tic)

# Calculate and show statistics
mean_time_original = sum(times_original) / len(times_original)
total_time_original = time.time() - start
print(
    f"Mean time for original method: {mean_time_original} (total time: {total_time_original})"
)

# USING VECTORISED METHOD
# ======================================================================
df["frequency_str_findall"] = 0
times_new: list[float] = []
start = time.time()
for word in words[:limit]:
    tic = time.time()
    df["frequency_str_findall"] = df["frequency_str_findall"] + df[
        "sequence"
    ].str.findall(f"(?={word})").apply(len)
    times_new.append(time.time() - tic)

# Calculate and show statistics
mean_time_new = sum(times_new) / len(times_new)
total_time_new = time.time() - start
print(
    f"Mean time for vectorized method: {mean_time_new} (total time: {total_time_new})"
)

# USING VECTORISED METHOD WITH BATCHING
# ======================================================================
times_new_batch: list[float] = []
df["frequency_str_findall_batch"] = 0
batch_size = 100
start = time.time()
for index in range(0, len(words[:limit]), batch_size):
    tic = time.time()
    pattern = (
        "(?=" + "|".join([word for word in words[index : index + batch_size]]) + ")"
    )
    df["frequency_str_findall_batch"] = df["frequency_str_findall_batch"] + df[
        "sequence"
    ].str.findall(pattern).apply(len)
    times_new_batch.append(time.time() - tic)

# Calculate and show statistics
mean_time_new_batch = sum(times_new_batch) / len(times_new_batch) / batch_size
total_time_new_batch = time.time() - start
print(
    f"Mean time for batched method: {mean_time_new_batch} (total time: {total_time_new_batch})"
)

# Show that the methods are equivalent
assert df["frequency_original"].equals(df["frequency_str_findall"])
assert df["frequency_str_findall"].equals(df["frequency_str_findall_batch"])

Mean time for original method: 0.16653778791427612 (total time: 16.65402317047119)
Mean time for vectorized method: 0.14487099170684814 (total time: 14.487332344055176)
Mean time for batched method: 0.005008835792541504 (total time: 0.5008978843688965)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接