我有一个数据框架,大部分已经向量化,但是对于一些列需要使用groupBy循环。对于小数据集来说速度可以接受,但是对于50k+行以上的任何数据集,速度变得非常慢。
基本思路是当列“unique”具有值(np.isfinite)时,等待一定数量的天数(例如4天),并将“complete”设置为“True”。重复执行。应忽略4个时间段(天数)之间的正结果。
这就是我现在拥有的,它完全可用,但是速度非常慢。我非常想知道如何将其向量化。
基本思路是当列“unique”具有值(np.isfinite)时,等待一定数量的天数(例如4天),并将“complete”设置为“True”。重复执行。应忽略4个时间段(天数)之间的正结果。
这就是我现在拥有的,它完全可用,但是速度非常慢。我非常想知道如何将其向量化。
times = np.arange(datetime(2019, 11, 1), datetime(2019, 12, 1), timedelta(days=1)).astype(datetime)
times = np.concatenate([times, times])
names = np.array(['ALFA'] * 30 + ['BETA'] * 30)
unique = np.random.randn(60)
unique[unique < 0.7] = np.nan
df = pd.DataFrame({'unique':unique, 'complete':np.nan}, index=[names, times])
df.index = df.index.set_names(['Name', 'Date'])
df['num'] = df.groupby('Name').cumcount()
entryNum, posit = len(df.index)+1, 0
for n, group in df.groupby(level=['Name']):
posit = 0
for date, col in group.groupby(level=['Date']):
if col.num[0] - entryNum == 4:
posit = 0
df.loc[(n, date), 'complete'] = True
if not posit and np.isfinite(col.unique[0]):
posit = 1
entryNum = col.num[0]
rafaelc的方案很棒,但在某些情况下会有所不同:
测试unique
列的数据集:
unique = [0.808154, np.nan, np.nan, 0.976455, np.nan, 1.81917, np.nan, 0.732306, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 0.878656, np.nan, 1.087899, 1.57941, 1.211292, np.nan, 1.431411, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 1.323002, 1.339211, np.nan, np.nan, 1.322755, np.nan, 0.960014, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 1.833514, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 2.3884, np.nan, np.nan, 1.372292, np.nan, np.nan]
输出:
unique complete countnonnull solution
Name Date
ALFA 2019-11-01 0.808154 NaN 1.0 False
2019-11-02 NaN NaN 1.0 False
2019-11-03 NaN NaN 1.0 False
2019-11-04 0.976455 NaN 2.0 False
2019-11-05 NaN True 1.0 True
2019-11-06 1.819170 NaN 2.0 False
2019-11-07 NaN NaN 2.0 False
2019-11-08 0.732306 NaN 2.0 False
2019-11-09 NaN NaN 2.0 False
2019-11-10 NaN True 1.0 False
2019-11-11 NaN NaN 1.0 False
2019-11-12 NaN NaN 0.0 False
2019-11-13 NaN NaN 0.0 False
2019-11-14 NaN NaN 0.0 False
2019-11-15 NaN NaN 0.0 False
2019-11-16 NaN NaN 0.0 False
2019-11-17 NaN NaN 0.0 False
2019-11-18 0.878656 NaN 1.0 False
2019-11-19 NaN NaN 1.0 False
2019-11-20 1.087899 NaN 2.0 False
2019-11-21 1.579410 NaN 3.0 False
2019-11-22 1.211292 True 3.0 True
2019-11-23 NaN NaN 3.0 False
2019-11-24 1.431411 NaN 3.0 False
2019-11-25 NaN NaN 2.0 False
2019-11-26 NaN True 1.0 False
2019-11-27 NaN NaN 1.0 False
2019-11-28 NaN NaN 0.0 False
2019-11-29 NaN NaN 0.0 False
2019-11-30 NaN NaN 0.0 False
BETA 2019-11-01 1.323002 NaN 1.0 False
2019-11-02 1.339211 NaN 2.0 False
2019-11-03 NaN NaN 2.0 False
2019-11-04 NaN NaN 2.0 False
2019-11-05 1.322755 True 2.0 True
2019-11-06 NaN NaN 1.0 False
2019-11-07 0.960014 NaN 2.0 False
2019-11-08 NaN NaN 2.0 False
2019-11-09 NaN True 1.0 False
2019-11-10 NaN NaN 1.0 False
2019-11-11 NaN NaN 0.0 False
2019-11-12 NaN NaN 0.0 False
2019-11-13 NaN NaN 0.0 False
2019-11-14 1.833514 NaN 1.0 False
2019-11-15 NaN NaN 1.0 False
2019-11-16 NaN NaN 1.0 False
2019-11-17 NaN NaN 1.0 False
2019-11-18 NaN True 0.0 True
2019-11-19 NaN NaN 0.0 False
2019-11-20 NaN NaN 0.0 False
2019-11-21 NaN NaN 0.0 False
2019-11-22 NaN NaN 0.0 False
2019-11-23 NaN NaN 0.0 False
2019-11-24 NaN NaN 0.0 False
2019-11-25 2.388400 NaN 1.0 False
2019-11-26 NaN NaN 1.0 False
2019-11-27 NaN NaN 1.0 False
2019-11-28 1.372292 NaN 2.0 False
2019-11-29 NaN True 1.0 True
2019-11-30 NaN NaN NaN False
N = 4
)来“绕过”循环的需要,这意味着可能会消耗大量内存.. - rafaelcunique
中所有的1
来尝试您的解决方案。 - Quang Hoang