性能问题:Python pandas DataFrame.to_csv追加操作逐渐变慢

4

初始问题:

我正在使用Python Pandas循环遍历几千个pickle文件,其中包含数据帧(DataFrame),行数不同(大约在600到1300之间),但列数相同(确切地说是636)。然后我对它们进行转换(每个数据帧的转换都完全相同),并使用DataFrame.to_csv()方法将它们追加到csv文件中。

to_csv 代码摘录:

if picklefile == '0000.p':
    dftemp.to_csv(finalnormCSVFile)
else:
    dftemp.to_csv(finalnormCSVFile, mode='a', header=False)

我感到困扰的是,它一开始运行得很快,但性能呈指数级下降,我记录了处理时间日志:

start: 2015-03-24 03:26:36.958058

2015-03-24 03:26:36.958058
count = 0
time: 0:00:00

2015-03-24 03:30:53.254755
count = 100
time: 0:04:16.296697

2015-03-24 03:39:16.149883
count = 200
time: 0:08:22.895128

2015-03-24 03:51:12.247342
count = 300
time: 0:11:56.097459

2015-03-24 04:06:45.099034
count = 400
time: 0:15:32.851692

2015-03-24 04:26:09.411652
count = 500
time: 0:19:24.312618

2015-03-24 04:49:14.519529
count = 600
time: 0:23:05.107877

2015-03-24 05:16:30.175175
count = 700
time: 0:27:15.655646

2015-03-24 05:47:04.792289
count = 800
time: 0:30:34.617114

2015-03-24 06:21:35.137891
count = 900
time: 0:34:30.345602

2015-03-24 06:59:53.313468
count = 1000
time: 0:38:18.175577

2015-03-24 07:39:29.805270
count = 1100
time: 0:39:36.491802

2015-03-24 08:20:30.852613
count = 1200
time: 0:41:01.047343

2015-03-24 09:04:14.613948
count = 1300
time: 0:43:43.761335

2015-03-24 09:51:45.502538
count = 1400
time: 0:47:30.888590

2015-03-24 11:09:48.366950
count = 1500
time: 1:18:02.864412

2015-03-24 13:02:33.152289
count = 1600
time: 1:52:44.785339

2015-03-24 15:30:58.534493
count = 1700
time: 2:28:25.382204

2015-03-24 18:09:40.391639
count = 1800
time: 2:38:41.857146

2015-03-24 21:03:19.204587
count = 1900
time: 2:53:38.812948

2015-03-25 00:00:05.855970
count = 2000
time: 2:56:46.651383

2015-03-25 03:53:05.020944
count = 2100
time: 3:52:59.164974

2015-03-25 05:02:16.534149
count = 2200
time: 1:09:11.513205

2015-03-25 06:07:32.446801
count = 2300
time: 1:05:15.912652

2015-03-25 07:13:45.075216
count = 2400
time: 1:06:12.628415

2015-03-25 08:20:17.927286
count = 2500
time: 1:06:32.852070

2015-03-25 09:27:20.676520
count = 2600
time: 1:07:02.749234

2015-03-25 10:35:01.657199
count = 2700
time: 1:07:40.980679

2015-03-25 11:43:20.788178
count = 2800
time: 1:08:19.130979

2015-03-25 12:53:57.734390
count = 2900
time: 1:10:36.946212

2015-03-25 14:07:20.936314
count = 3000
time: 1:13:23.201924

2015-03-25 15:22:47.076786
count = 3100
time: 1:15:26.140472

2015-03-25 19:51:10.776342
count = 3200
time: 4:28:23.699556

2015-03-26 03:06:47.372698
count = 3300
time: 7:15:36.596356

count = 3324
end of cycle: 2015-03-26 03:59:54.161842

end: 2015-03-26 03:59:54.161842
total duration: 2 days, 0:33:17.203784

更新 #1:

我按照你的建议@Alexander去做了,但这肯定与to_csv()方法有关:

start: 2015-03-26 05:18:25.948410

2015-03-26 05:18:25.948410
count = 0
time: 0:00:00

2015-03-26 05:20:30.425041
count = 100
time: 0:02:04.476631

2015-03-26 05:22:27.680582
count = 200
time: 0:01:57.255541

2015-03-26 05:24:26.012598
count = 300
time: 0:01:58.332016

2015-03-26 05:26:16.542835
count = 400
time: 0:01:50.530237

2015-03-26 05:27:58.063196
count = 500
time: 0:01:41.520361

2015-03-26 05:29:45.769580
count = 600
time: 0:01:47.706384

2015-03-26 05:31:44.537213
count = 700
time: 0:01:58.767633

2015-03-26 05:33:41.591837
count = 800
time: 0:01:57.054624

2015-03-26 05:35:43.963843
count = 900
time: 0:02:02.372006

2015-03-26 05:37:46.171643
count = 1000
time: 0:02:02.207800

2015-03-26 05:38:36.493399
count = 1100
time: 0:00:50.321756

2015-03-26 05:39:42.123395
count = 1200
time: 0:01:05.629996

2015-03-26 05:41:13.122048
count = 1300
time: 0:01:30.998653

2015-03-26 05:42:41.885513
count = 1400
time: 0:01:28.763465

2015-03-26 05:44:20.937519
count = 1500
time: 0:01:39.052006

2015-03-26 05:46:16.012842
count = 1600
time: 0:01:55.075323

2015-03-26 05:48:14.727444
count = 1700
time: 0:01:58.714602

2015-03-26 05:50:15.792909
count = 1800
time: 0:02:01.065465

2015-03-26 05:51:48.228601
count = 1900
time: 0:01:32.435692

2015-03-26 05:52:22.755937
count = 2000
time: 0:00:34.527336

2015-03-26 05:52:58.289474
count = 2100
time: 0:00:35.533537

2015-03-26 05:53:39.406794
count = 2200
time: 0:00:41.117320

2015-03-26 05:54:11.348939
count = 2300
time: 0:00:31.942145

2015-03-26 05:54:43.057281
count = 2400
time: 0:00:31.708342

2015-03-26 05:55:19.483600
count = 2500
time: 0:00:36.426319

2015-03-26 05:55:52.216424
count = 2600
time: 0:00:32.732824

2015-03-26 05:56:27.409991
count = 2700
time: 0:00:35.193567

2015-03-26 05:57:00.810139
count = 2800
time: 0:00:33.400148

2015-03-26 05:58:17.109425
count = 2900
time: 0:01:16.299286

2015-03-26 05:59:31.021719
count = 3000
time: 0:01:13.912294

2015-03-26 06:00:49.200303
count = 3100
time: 0:01:18.178584

2015-03-26 06:02:07.732028
count = 3200
time: 0:01:18.531725

2015-03-26 06:03:28.518541
count = 3300
time: 0:01:20.786513

count = 3324
end of cycle: 2015-03-26 06:03:47.321182

end: 2015-03-26 06:03:47.321182
total duration: 0:45:21.372772

如您所请求,以下是源代码:

import pickle
import pandas as pd
import numpy as np
from os import listdir
from os.path import isfile, join
from datetime import datetime

# Defining function to deep copy pandas data frame:
def very_deep_copy(self):
    return pd.DataFrame(self.values.copy(), self.index.copy(), self.columns.copy())

# Adding function to Dataframe module:    
pd.DataFrame.very_deep_copy = very_deep_copy

#Define Data Frame Header:
head = [
    'ConcatIndex', 'Concatenated String Index', 'FileID', ..., 'Attribute<autosave>', 'Attribute<bgcolor>'
    ]
exclude = [
    'ConcatIndex', 'Concatenated String Index', 'FileID', ... , 'Real URL Array'
    ]

path = "./dataset_final/"
pickleFiles = [ f for f in listdir(path) if isfile(join(path,f)) ]
finalnormCSVFile = 'finalNormalizedDataFrame2.csv'

count = 0
start_time = datetime.now()
t1 = start_time
print("start: " + str(start_time) + "\n")


for picklefile in pickleFiles: 
    if count%100 == 0:
        t2 = datetime.now()
        print(str(t2))
        print('count = ' + str(count))
        print('time: ' + str(t2 - t1) + '\n')
        t1 = t2

    #DataFrame Manipulation:
    df = pd.read_pickle(path + picklefile)

    df['ConcatIndex'] = 100000*df.FileID + df.ID
    for i in range(0, len(df)):
        df.loc[i, 'Concatenated String Index'] = str(df['ConcatIndex'][i]).zfill(10)
    df.index = df.ConcatIndex


    #DataFrame Normalization:
    dftemp = df.very_deep_copy()
    for string in head:
        if string in exclude:
            if string != 'ConcatIndex':
                dftemp.drop(string, axis=1, inplace=True)
        else:
            if 'Real ' in string:
                max = pd.DataFrame.max(df[string.strip('Real ')])
            elif 'child' in string:
                max = pd.DataFrame.max(df[string.strip('child')+'desc'])
            else:
                max = pd.DataFrame.max(df[string])

            if max != 0:
                dftemp[string] = dftemp[string]/max

    dftemp.drop('ConcatIndex', axis=1, inplace=True)

    #Saving DataFrame in CSV:
    if picklefile == '0000.p':
        dftemp.to_csv(finalnormCSVFile)
    else:
        dftemp.to_csv(finalnormCSVFile, mode='a', header=False)

    count += 1

print('count = ' + str(count))
cycle_end_time = datetime.now()
print("end of cycle: " + str(cycle_end_time) + "\n")

end_time = datetime.now()
print("end: " + str(end_time))
print('total duration: ' + str(end_time - start_time) + '\n')

更新 #2:

如建议所示,我对前几百个pickle文件执行了命令%prun %run "./DataSetNormalization.py",结果如下:


(Note: "pickle" here refers to a Python serialization format.)
   136373640 function calls (136342619 primitive calls) in 1018.769 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      220  667.069    3.032  667.069    3.032 {method 'close' of '_io.TextIOWrapper' objects}
     1540   42.046    0.027   46.341    0.030 {pandas.lib.write_csv_rows}
      219   34.886    0.159   34.886    0.159 {built-in method collect}
     3520   16.782    0.005   16.782    0.005 {pandas.algos.take_2d_axis1_object_object}
    78323    9.948    0.000    9.948    0.000 {built-in method empty}
 25336892    9.645    0.000   12.635    0.000 {built-in method isinstance}
  1433941    9.344    0.000    9.363    0.000 generic.py:1845(__setattr__)
221051/220831    7.387    0.000  119.767    0.001 indexing.py:194(_setitem_with_indexer)
   723540    7.312    0.000    7.312    0.000 {method 'reduce' of 'numpy.ufunc' objects}
   273414    7.137    0.000   20.642    0.000 internals.py:2656(set)
   604245    6.846    0.000    6.850    0.000 {method 'copy' of 'numpy.ndarray' objects}
     1760    6.566    0.004    6.566    0.004 {pandas.lib.isnullobj}
   276274    5.315    0.000    5.315    0.000 {method 'ravel' of 'numpy.ndarray' objects}
  1719244    5.264    0.000    5.266    0.000 {built-in method array}
  1102450    5.070    0.000   29.543    0.000 internals.py:1804(make_block)
  1045687    5.056    0.000   10.209    0.000 index.py:709(__getitem__)
        1    4.718    4.718 1018.727 1018.727 DataSetNormalization.py:6(<module>)
   602485    4.575    0.000   15.087    0.000 internals.py:2586(iget)
   441662    4.562    0.000   33.386    0.000 internals.py:2129(apply)
   272754    4.550    0.000    4.550    0.000 internals.py:1291(set)
   220883    4.073    0.000    4.073    0.000 {built-in method charmap_encode}
  4781222    3.805    0.000    4.349    0.000 {built-in method getattr}
    52143    3.673    0.000    3.673    0.000 {built-in method truediv}
  1920486    3.671    0.000    3.672    0.000 {method 'get_loc' of 'pandas.index.IndexEngine' objects}
  1096730    3.513    0.000    8.370    0.000 internals.py:3035(__init__)
   875899    3.508    0.000   14.458    0.000 series.py:134(__init__)
   334357    3.420    0.000    3.439    0.000 {pandas.lib.infer_dtype}
  2581268    3.419    0.000    4.774    0.000 {pandas.lib.values_from_object}
  1102450    3.036    0.000    6.110    0.000 internals.py:59(__init__)
   824856    2.888    0.000   45.749    0.000 generic.py:1047(_get_item_cache)
  2424185    2.657    0.000    3.870    0.000 numeric.py:1910(isscalar)
   273414    2.505    0.000    9.332    0.000 frame.py:2113(_sanitize_column)
  1646198    2.491    0.000    2.880    0.000 index.py:698(__contains__)
   879639    2.461    0.000    2.461    0.000 generic.py:87(__init__)
   552988    2.385    0.000    4.451    0.000 internals.py:3565(_get_blkno_placements)
   824856    2.349    0.000   51.282    0.000 frame.py:1655(__getitem__)
   220831    2.224    0.000   21.670    0.000 internals.py:460(setitem)
   326437    2.183    0.000   11.352    0.000 common.py:1862(_possibly_infer_to_datetimelike)
   602485    2.167    0.000   16.974    0.000 frame.py:1982(_box_item_values)
   602485    2.087    0.000   23.202    0.000 internals.py:2558(get)
   770739    2.036    0.000    6.471    0.000 internals.py:1238(__init__)
   276494    1.966    0.000    1.966    0.000 {pandas.lib.get_blkno_indexers}
10903876/10873076    1.935    0.000    1.972    0.000 {built-in method len}
   220831    1.924    0.000   76.647    0.000 indexing.py:372(setter)
      220    1.893    0.009    1.995    0.009 {built-in method load}
  1920486    1.855    0.000    8.198    0.000 index.py:1173(get_loc)
   112860    1.828    0.000    9.607    0.000 common.py:202(_isnull_ndarraylike)
   602485    1.707    0.000    8.903    0.000 series.py:238(from_array)
   875899    1.688    0.000    2.493    0.000 series.py:263(_set_axis)
     3300    1.661    0.001    1.661    0.001 {method 'tolist' of 'numpy.ndarray' objects}
  1102670    1.609    0.000    2.024    0.000 internals.py:108(mgr_locs)
  4211850    1.593    0.000    1.593    0.000 {built-in method issubclass}
  1335546    1.501    0.000    2.253    0.000 generic.py:297(_get_axis_name)
   273414    1.411    0.000   37.866    0.000 frame.py:1994(__setitem__)
   441662    1.356    0.000    7.884    0.000 indexing.py:982(_convert_to_indexer)
   220831    1.349    0.000  131.331    0.001 indexing.py:95(__setitem__)
   273414    1.329    0.000   23.170    0.000 generic.py:1138(_set_item)
   326437    1.276    0.000    6.203    0.000 fromnumeric.py:2259(prod)
   274734    1.271    0.000    2.113    0.000 shape_base.py:60(atleast_2d)
   273414    1.242    0.000   34.396    0.000 frame.py:2072(_set_item)
   602485    1.183    0.000    1.979    0.000 generic.py:1061(_set_as_cached)
   934422    1.175    0.000    1.894    0.000 {method 'view' of 'numpy.ndarray'objects}
     1540    1.144    0.001   58.217    0.038 format.py:1409(_save_chunk)
   220831    1.144    0.000    9.198    0.000 indexing.py:139(_convert_tuple)
   441662    1.137    0.000    3.036    0.000 indexing.py:154(_convert_scalar_indexer)
   220831    1.087    0.000    1.281    0.000 arrayprint.py:343(array2string)
  1332026    1.056    0.000    3.997    0.000 generic.py:310(_get_axis)
   602485    1.046    0.000    9.949    0.000 frame.py:1989(_box_col_values)
      220    1.029    0.005    1.644    0.007 internals.py:2429(_interleave)
   824856    1.025    0.000   46.777    0.000 frame.py:1680(_getitem_column)
  1491578    1.022    0.000    2.990    0.000 common.py:58(_check)
   782616    1.010    0.000    3.513    0.000 numeric.py:394(asarray)
   290354    0.988    0.000    1.386    0.000 internals.py:1950(shape)
   220831    0.958    0.000   15.392    0.000 generic.py:2101(copy)
   273414    0.940    0.000    1.796    0.000 indexing.py:1520(_convert_to_index_sliceable)
   220831    0.920    0.000    1.558    0.000 common.py:1110(_possibly_downcast_to_dtype)
   220611    0.914    0.000    0.914    0.000 {pandas.lib.is_bool_array}
   498646    0.906    0.000    0.906    0.000 {method 'clear' of 'dict' objects}
   715345    0.848    0.000   13.083    0.000 common.py:132(_isnull_new)
   452882    0.824    0.000    1.653    0.000 index.py:256(__array_finalize__)
   602485    0.801    0.000    0.801    0.000 internals.py:208(iget)
    52583    0.748    0.000    2.038    0.000 common.py:1223(_fill_zeros)
   606005    0.736    0.000    6.755    0.000 internals.py:95(make_block_same_class)
   708971    0.732    0.000    2.156    0.000 internals.py:3165(values)
  1760378    0.724    0.000    0.724    0.000 internals.py:2025(_get_items)
   109560    0.720    0.000    6.140    0.000 nanops.py:152(_get_values)
   220831    0.718    0.000   11.017    0.000 internals.py:2395(copy)
   924669    0.712    0.000    1.298    0.000 common.py:2248(_get_dtype_type)
  1515796    0.698    0.000    0.868    0.000 {built-in method hasattr}
   220831    0.670    0.000    4.299    0.000 internals.py:435(copy)
   875899    0.661    0.000    0.661    0.000 series.py:285(_set_subtyp)
   220831    0.648    0.000    0.649    0.000 {method 'get_value' of 'pandas.index.IndexEngine' objects}
   452882    0.640    0.000    0.640    0.000 index.py:218(_reset_identity)
   715345    0.634    0.000    1.886    0.000 {pandas.lib.isscalar}
     1980    0.626    0.000    1.172    0.001 internals.py:3497(_merge_blocks)
   220831    0.620    0.000    2.635    0.000 common.py:1933(_is_bool_indexer)
   272754    0.608    0.000    0.899    0.000 internals.py:1338(should_store)
   220831    0.599    0.000    3.463    0.000 series.py:482(__getitem__)
   498645    0.591    0.000    1.497    0.000 generic.py:1122(_clear_item_cache)
  1119390    0.584    0.000    1.171    0.000 index.py:3936(_ensure_index)
   220831    0.573    0.000    1.883    0.000 index.py:222(view)
   814797    0.555    0.000    0.905    0.000 internals.py:3086(_values)
    52583    0.543    0.000   15.545    0.000 ops.py:469(wrapper)
   220831    0.536    0.000    3.760    0.000 internals.py:371(_try_cast_result)
   228971    0.533    0.000    0.622    0.000 generic.py:1829(__getattr__)
   769651    0.528    0.000    0.528    0.000 {built-in method min}
   224351    0.509    0.000    2.030    0.000 generic.py:1099(_maybe_update_cacher)
   ...

我会重新运行它以确认,但看起来它肯定与pandas的to_csv()方法有关,因为大部分运行时间用于io和csv writer。为什么会产生这种影响?有什么建议吗?

更新 #3:

好的,我进行了完整的%prun测试,确实将近90%的时间用于{method 'close' of '_io.TextIOWrapper' objects}。所以我想这就是问题所在...你们觉得呢?

我的问题如下:

  1. 这里降低性能的原因是什么?
  2. pandas.DataFrames.to_csv()的追加模式是否每次写入时都会加载整个文件?
  3. 有没有办法提高处理速度?

1
如果没有能够演示在写入文件时出现问题的样本代码,调试将会很困难。 - BrenBarn
1
你确定时间延迟是由于.to_csv写入而不是DataFrame处理(你没有提供任何代码)吗?尝试跳过.csv写入,只打印时间,看看是否仍然存在相同的性能问题。 - Alexander
@Alexander 添加了完整的源代码,将在几分钟内测试您的建议。 - CMPSoares
1
根据内存限制,你是否尝试过在内存中连接数据框,例如,每100个数据框连接一次,然后将批处理数据框保存为.csv文件?例如,如果你将每个数据框的结果追加到一个列表中,然后在导出之前将列表连接成一个数据框。 - Alexander
是的,亚历山大,我试过了。但是这样会更慢。最终的CSV文件大小接近6GB。 - CMPSoares
2个回答

3
在这种情况下,您应该对代码进行概要分析(以查看哪些函数调用花费的时间最长),这样您就可以经验性地检查确实是在read_csv中变慢而不是其他地方...从您的代码中可以发现:首先,这里有很多复制和循环(不够向量化)...每次看到循环时都要寻找一种消除它的方法。其次,当您使用zfill等功能时,我想知道您是否需要 to_fwf (固定宽度格式)而不是 to_csv ?一些常识测试:某些文件比其他文件大得多(这可能导致您磁盘交换)吗?您确定最大的文件只有1200行吗?您已经检查过了吗?例如使用wc -l。在我看来,垃圾回收不太可能是问题...(如其他答案中建议的那样)。

这里有几个改进您的代码的建议,应该会提高运行时间。

列是固定的,我会提取列计算并对实数、子项和其他标准化进行向量化。使用apply而不是迭代(对于zfill)。

columns_to_drop = set(head) & set(exclude)  # maybe also - ['ConcatIndex']
remaining_cols = set(head) - set(exclude)
real_cols = [r for r in remaining_cols if 'Real ' in r]
real_cols_suffix = [r.strip('Real ') for r in real]
remaining_cols = remaining_cols - real_cols
child_cols = [r for r in remaining_cols if 'child' in r]
child_cols_desc = [r.strip('child'+'desc') for r in real]
remaining_cols = remaining_cols - child_cols

for count, picklefile in enumerate(pickleFiles):
    if count % 100 == 0:
        t2 = datetime.now()
        print(str(t2))
        print('count = ' + str(count))
        print('time: ' + str(t2 - t1) + '\n')
        t1 = t2

    #DataFrame Manipulation:
    df = pd.read_pickle(path + picklefile)

    df['ConcatIndex'] = 100000*df.FileID + df.ID
    # use apply here rather than iterating
    df['Concatenated String Index'] = df['ConcatIndex'].apply(lambda x: str(x).zfill(10))
    df.index = df.ConcatIndex

    #DataFrame Normalization:
    dftemp = df.very_deep_copy()  # don't *think* you need this

    # drop all excludes
    dftemp.drop(columns_to_drop), axis=1, inplace=True)

    # normalize real cols
    m = dftemp[real_cols_suffix].max()
    m.index = real_cols
    dftemp[real_cols] = dftemp[real_cols] / m

    # normalize child cols
    m = dftemp[child_cols_desc].max()
    m.index = child_cols
    dftemp[child_cols] = dftemp[child_cols] / m

    # normalize remaining
    remaining = list(remaining - child)
    dftemp[remaining] = dftemp[remaining] / dftemp[remaining].max()

    # if this case is important then discard the rows of m with .max() is 0
    #if max != 0:
    #    dftemp[string] = dftemp[string]/max

    # this is dropped earlier, if you need it, then subtract ['ConcatIndex'] from columns_to_drop
    # dftemp.drop('ConcatIndex', axis=1, inplace=True)

    #Saving DataFrame in CSV:
    if picklefile == '0000.p':
        dftemp.to_csv(finalnormCSVFile)
    else:
        dftemp.to_csv(finalnormCSVFile, mode='a', header=False)

作为一种风格的选择,我可能会选择将这些部分包装成函数,这也意味着如果这真的是问题,更多的东西可以被垃圾回收...

另一个更快的选择是使用pytables(HDF5Store),如果您不需要结果输出为csv(但我想您需要)...

迄今为止最好的做法是对您的代码进行分析。例如,在ipython中使用%prun,请参见http://pynash.org/2013/03/06/timing-and-profiling.html。然后,您可以看到它肯定是read_csv,并具体到哪里(您的代码的哪一行以及pandas代码的哪些行)。


啊哈,我错过了你是将所有内容附加到一个csv文件中。在你的prun中显示大部分时间都花在了close上,所以让我们保持文件打开:
# outside of the for loop (so the file is opened and closed only once)
f = open(finalnormCSVFile, 'w')

...
for picklefile in ...

    if picklefile == '0000.p':
        dftemp.to_csv(f)
    else:
        dftemp.to_csv(f, mode='a', header=False)
...

f.close()

每次在追加内容前打开文件时,需要先寻找末尾再进行写入操作。这可能是比较耗费资源的(我不明白为什么会这么糟糕,但保持文件处于打开状态可以避免这种情况的发生)。

1
@CMPSoares 谢谢,你检查了所有文件的大小了吗?有没有一个很大的文件?如果这些文件很大,那么它是一个内存问题就不足为奇了(但我猜也有可能!)… 你有多少内存? - Andy Hayden
1
@CMPSoares 嗯,既然这么多时间都花在了接近状态上,那你每次打开一个文件怎么样?思考一下,这意味着它只需单次写入。 - Andy Hayden
1
好的,我必须得承认你做得很好。虽然Geeklhem的回答在性能上有所帮助,但只能减少大约5%的执行时间。而你最后编辑的回答却将执行时间减少了近98.3%(从2天30分钟缩短到50分钟)。非常感谢你的帮助!这真是受益匪浅!祝复活节快乐! - CMPSoares
1
@CMPSoares 太棒了,很高兴能帮到你!这是一个惊人的差异,我觉得这应该被包含在 pandas 性能提示中(不确定应该放在哪里)。哇塞。 - Andy Hayden
1
@CMPSoares 最后一次跟进(为了看看我能否重新创建这个/添加性能提示),生成的CSV文件大小是多少? - Andy Hayden
显示剩余12条评论

2
我猜测这可能是由于您正在进行的very_deep_copy,您是否检查了随时间推移的内存使用情况?可能内存没有正确释放。
如果是这个问题,您可以采取以下措施之一:
1)完全避免复制(从性能上来说更好)。
2)偶尔使用gc.collect()强制垃圾回收。
有关相关问题,请参见“Python垃圾回收”,以及此文章介绍Python中的垃圾回收编辑: 消除复制的解决方案是:
1)在归一化之前为每列存储归一化常数。
2)在归一化后删除不需要的列。
# Get the normalizing constant for each column.
max = {}

for string in head:
    if string not in exclude:
        if 'Real ' in string:
           max[string] = df[string.strip('Real ')].max()
        elif 'child' in string:
           max[string] = df[string.strip('child')+'desc'].max()
        else:
           max[string] = df[string].max()

# Actual normalization, each column is divided by
# its constant if possible. 
for key,value in max.items():
    if value != 0:
        df[key] /= value

# Drop the excluded columns 
df.drop(exclude, axis=1, inplace=True)

1
你为什么认为这是垃圾回收呢? - Andy Hayden
1
确实,collect可能会很慢,您不希望在每个迭代中都运行它。也许您可以尝试在 if count%100 == 0: 块中调用它。 - Geeklhem
1
你尝试过完全删除 very_deep_copy 吗(例如,如我在答案中建议的那样)?我认为这是你在性能方面最好的选择。 - Geeklhem
1
如果我理解正确,您的规范化是将每列除以另一列的最大值。您进行复制是因为希望在他们规范化之前获得其他列的最大值。一个简单的解决方法是首先存储所有归一化常量(最大值),然后修改数据框。我已经更新了我的答案中提出的代码片段。 - Geeklhem
1
关于to_csv()问题,我必须承认我很难找到可能导致这个问题的原因,因为我经常使用它。在其核心中,它使用了Python标准库的csv模块(带有分块和一点Cython来加速)。也许在处理大文件时会出现问题,特别是在Windows系统上?您是否尝试将数据保存到不同的文件而不是追加到同一个文件中? - Geeklhem
显示剩余5条评论

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接