我正在寻找使用pyarrow
快速存储和检索numpy
数组的方法。对于检索,我感到非常满意。从包含1,000,000,000个dtype = np.uint16
整数的.arrow
文件中提取列只需要不到1秒钟。
import pyarrow as pa
import numpy as np
def write(arr, name):
arrays = [pa.array(col) for col in arr]
names = [str(i) for i in range(len(arrays))]
batch = pa.RecordBatch.from_arrays(arrays, names=names)
with pa.OSFile(name, 'wb') as sink:
with pa.RecordBatchStreamWriter(sink, batch.schema) as writer:
writer.write_batch(batch)
def read(name):
source = pa.memory_map(name, 'r')
table = pa.ipc.RecordBatchStreamReader(source).read_all()
for i in range(table.num_columns):
yield table.column(str(i)).to_numpy()
arr = np.random.randint(65535, size=(250, 4000000), dtype=np.uint16)
%%timeit -r 1 -n 1
write(arr, 'test.arrow')
>>> 25.6 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
%%timeit -r 1 -n 1
for n in read('test.arrow'): n
>>> 901 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
能否提高将数据写入.arrow
格式的效率?此外,我还测试了np.save
:
%%timeit -r 1 -n 1
np.save('test.npy', arr)
>>> 18.5 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
看起来速度有点快。我们可以进一步优化Apache Arrow,以更好地将数据写入.arrow
格式吗?
np.random.randint()
返回生成器或类似的惰性结构吗?您是否计时随机数生成以及写入操作?(当我使用pandas写入parquet文件时,即使在HDD上也比这快得多。) - MatBailie