我正在寻找使用Python从S3中的多个分区目录读取数据的方法。
data_folder/serial_number=1/cur_date=20-12-2012/abcdsd0324324.snappy.parquet
data_folder/serial_number=2/cur_date=27-12-2012/asdsdfsd0324324.snappy.parquet
pyarrow的ParquetDataset模块具有读取分区的功能。因此,我尝试了以下代码:
>>> import pandas as pd
>>> import pyarrow.parquet as pq
>>> import s3fs
>>> a = "s3://my_bucker/path/to/data_folder/"
>>> dataset = pq.ParquetDataset(a)
它抛出了以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 502, in __init__
self.metadata_path) = _make_manifest(path_or_paths, self.fs)
File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 601, in _make_manifest
.format(path))
OSError: Passed non-file path: s3://my_bucker/path/to/data_folder/
根据pyarrow的文档,我尝试使用s3fs作为文件系统,即:
>>> dataset = pq.ParquetDataset(a,filesystem=s3fs)
会抛出以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 502, in __init__
self.metadata_path) = _make_manifest(path_or_paths, self.fs)
File "/home/my_username/anaconda3/lib/python3.6/site-packages/pyarrow/parquet.py", line 583, in _make_manifest
if is_string(path_or_paths) and fs.isdir(path_or_paths):
AttributeError: module 's3fs' has no attribute 'isdir'
由于我只能使用ECS集群,因此spark / pyspark不是可选项。
有没有办法在python中轻松读取来自这些分区目录的parquet文件?我认为像在这个链接中建议的那样列出所有目录然后进行读取并不是一个好的做法。我需要将读取的数据转换为pandas dataframe以进行进一步处理,因此更喜欢与fastparquet或pyarrow相关的选项。我也可以考虑Python中的其他选项。