如何高效地向 Pandas 数据框中添加行

Question

如何高效地向 Pandas 数据框中添加行

3

我正试图创建一个虚拟文件，以便之后进行一些机器学习预测。输入是大约2000个'路线'，我想创建一个包含7天内每个年月日小时组合的虚拟文件，即每个路线168行，总共约350k行。

我遇到的问题是，当添加的行数达到某个大小时，pandas会变得非常缓慢。

我正在使用以下代码：

DAYS = [0, 1, 2, 3, 4, 5, 6]
HODS = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]

ISODOW = {
    1: "monday",
    2: "tuesday",
    3: "wednesday",
    4: "thursday",
    5: "friday",
    6: "saturday",
    7: "sunday"
}

def createMyPredictionDummy(start=datetime.datetime.now(), sourceFile=(utils.mountBasePath + 'routeProperties.csv'), destFile=(utils.outputBasePath + 'ToBePredictedTTimes.csv')):
    '''Generate a dummy file that can be used for predictions'''
    data = ['route', 'someProperties']
    dataFile = data + ['yr', 'month', 'day', 'dow', 'hod']

    # New DataFrame with all required columns
    file = pd.DataFrame(columns=dataFile)

    # Old data frame that has only the target columns    
    df = pd.read_csv(sourceFile, converters=convert, delimiter=',')
    df = df[data]

    # Counter - To avoid constant lookup for length of the DF
    ix = 0

    routes = df['route'].drop_duplicates().tolist()
    # Iterate through all routes and create a row for every route-yr-month-day-hour combination for 7 day -->  about 350k rows
    for no, route in enumerate(routes):
        print('Current route is %s which is no. %g out of %g' % (str(route), no+1, len(routes)))
        routeDF = df.loc[df['route'] == route].iloc[0].tolist()
        for i in range(0, 7):
            tmpDate = start + datetime.timedelta(days=i)
            day = tmpDate.day
            month = tmpDate.month
            year = tmpDate.year
            dow = ISODOW[tmpDate.isoweekday()]
            for hod in HODS:
                file.loc[ix] = routeDF + [year, month, day, dow, hod] # This is becoming terribly slow
                ix += 1
    file.to_csv(destFile, index=False)
    print('Wrote file')

我认为主要问题在于使用.loc[]来添加行 - 有没有更高效的方法来添加行？如果您有其他建议，我会很乐意听取所有建议！

谢谢，祝好

carbee

- cabeer

1

这可能对您有所帮助 https://dev59.com/P1gR5IYBdhLWcg3wAJJK#48287388 - John Karasinski

你有一些测试数据吗？ - hootnot

2

感谢链接，@JohnKarasinski！它的效果非常好 - 能否有人为我提供一些关于这个的见解？由于我还不能评论其他问题，所以我会在这里添加这些信息：如果您正在使用Python3，请使用io.StringIO()而不是io.BytesIO()。 - cabeer

2个回答

0

你创建了一个名为file的空数据框，然后通过追加行来填充它，这似乎是问题所在。如果你只是

def createMyPredictionDummy(...):
    ...
    # make it yield a dict of attributes from the for loop
    for hod in HODS:
        yield data

# then use this to create the *file* dataframe outside that function
newDF = pd.DataFrame([r for r in createMyPredictionDummy()])
newDF.to_csv(destFile, index=False)
print('Wrote file')

- hootnot

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Marco Spinaci · Accepted Answer

（这更像是一条长评论而不是一个答案，抱歉，但没有示例数据我不能运行太多...）

由于我觉得您是按顺序逐个添加行的（即，数据帧由顺序访问的整数索引），并且您始终知道列的顺序，因此最好创建一个列表的列表，然后将其转换为DataFrame，也就是说，定义类似于file_list = []，然后将file.loc[ix] = ...这一行替换为：

file_list.append(routeDF + [year, month, day, dow, hod])

最终，您可以定义如下：

file = pd.DataFrame(file_list, columns=dataFile)

如果你的所有数据都是固定类型的（例如 int，取决于你的 routeDF，并且在创建数据帧之前不将 dow 转换），那么你甚至可以通过预分配一个 numpy 数组并写入其中来更好地处理它，但我相信将元素添加到列表中不会成为你代码的瓶颈，因此这可能是过度优化。

另一种减少代码更改的替代方法是，简单地通过创建一个充满 NaN 的 DataFrame 来预先分配足够的空间，而不是创建没有行的 DataFrame，即将 file 的定义更改为（在将 drop_duplicates 行移动上去后）：

file = pd.DataFrame(columns=dataFile, index=range(len(routes)*168))

我相信这比你的代码快，但它可能仍然比上面的列表方法慢，因为在填充数据之前，它不知道要期望哪些数据类型（例如，它可能将您的整数转换为浮点数，这不是理想的）。但是，一旦摆脱由于在每个步骤中扩展数据框而导致的连续重新分配，这可能不再是您的瓶颈(双重循环可能会成为)。