使用pyarrow将pandas数据框进行分区并保存为parquet文件时，数据类型不会被保留。

Question

使用pyarrow将pandas数据框进行分区并保存为parquet文件时，数据类型不会被保留。

3

当使用pyarrow将pandas数据帧分区并保存为Parquet文件时，数据类型不会得到保留。

情况1：保存分区数据集 - 数据类型不被保留

# Saving a Pandas Dataframe to Local as a partioned parquet file using pyarrow
import pandas as pd
df = pd.DataFrame({'age': [77,32,234],'name':['agan','bbobby','test'] })
path = 'test'
partition_cols=['age']
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, partition_cols=partition_cols, preserve_index=False)


# Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)

输出：

Datatypes before saving the dataset
age      int64
name    object
dtype: object

Datatypes after loading the dataset
name      object
age     category
dtype: object

案例2：非分区数据集-数据类型得以保留。

import pandas as pd
print('Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow')
df = pd.DataFrame({'age': [77,32,234],'name':['agan','bbobby','test'] })
path = 'test_without_partition'
print('Datatypes before saving the dataset')
print(df.dtypes)
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, path, preserve_index=False)


# Loading a dataset partioned parquet dataset from local
df = pq.ParquetDataset(path, filesystem=None).read_pandas().to_pandas()
print('\nDatatypes after loading the dataset')
print(df.dtypes)

输出:

Saving a Pandas Dataframe to Local as a parquet file without partitioning using pyarrow
Datatypes before saving the dataset
age      int64
name    object
dtype: object

Datatypes after loading the dataset
age      int64
name    object
dtype: object

- Naga Budigam

2个回答

0

你可以尝试这个：

import pyarrow as pa
import pyarrow.parquet as pq

# Convert DataFrame to Apache Arrow Table
table = pa.Table.from_pandas(df)

# Parquet with Brotli compression
pq.write_table(table, 'file_name.parquet')

- Avinash

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Naga Budigam · Accepted Answer

没有明显的方法来做到这一点。请参考下面的JIRA问题。

https://issues.apache.org/jira/browse/ARROW-6114