使用pyarrow将pandas数据框进行分区并保存为parquet文件时,数据类型不会被保留。

3

当使用pyarrow将pandas数据帧分区并保存为Parquet文件时,数据类型不会得到保留。

情况1:保存分区数据集 - 数据类型不被保留

# Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
import pandas as pd
df = pd.DataFrame({'age': [77,32,234],'name':['agan','bbobby','test'] })
path = 'test'
partition_cols=['age']
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, partition_cols=partition_cols, preserve_index=False)


# Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)

输出:

Datatypes before saving the dataset
age      int64
name    object
dtype: object

Datatypes after loading the dataset
name      object
age     category
dtype: object

案例2:非分区数据集-数据类型得以保留。
import pandas as pd
print('Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow')
df = pd.DataFrame({'age': [77,32,234],'name':['agan','bbobby','test'] })
path = 'test_without_partition'
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, preserve_index=False)


# Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)

输出:

Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow
Datatypes before saving the dataset
age      int64
name    object
dtype: object

Datatypes after loading the dataset
age      int64
name    object
dtype: object
2个回答

1

0
你可以尝试这个:
import pyarrow as pa
import pyarrow.parquet as pq

# Convert DataFrame to Apache Arrow Table
table = pa.Table.from_pandas(df)

# Parquet with Brotli compression
pq.write_table(table, 'file_name.parquet')

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接