如何高效地将 Pandas 数据框写入 Google BigQuery？

Question

如何高效地将 Pandas 数据框写入 Google BigQuery？

pythonpandasgoogle-bigquerygoogle-cloud-storagegoogle-cloud-python

39

我正在尝试使用这里记录的pandas.DataFrame.to_gbq()函数将一个pandas.DataFrame上传到Google Big Query。问题在于，to_gbq()需要2.3分钟，而直接上传到Google Cloud Storage只需要不到一分钟的时间。我打算上传一堆数据框（~32）每个大小相似，所以我想知道哪种方式更快。这是我正在使用的脚本：

dataframe.to_gbq('my_dataset.my_table', 
                 'my_project_id',
                 chunksize=None, # I have tried with several chunk sizes, it runs faster when it's one big chunk (at least for me)
                 if_exists='append',
                 verbose=False
                 )

dataframe.to_csv(str(month) + '_file.csv') # the file size its 37.3 MB, this takes almost 2 seconds 
# manually upload the file into GCS GUI
print(dataframe.shape)
(363364, 21)

我的问题是，哪种方法更快？

使用 pandas.DataFrame.to_gbq() 函数上传 Dataframe
将 Dataframe 另存为 CSV 文件，然后使用 Python API 作为文件上传到 BigQuery
将 Dataframe 另存为 CSV 文件，然后使用此过程将文件上传到 Google Cloud Storage，再从 BigQuery 中读取它

更新：

备选方案1似乎比备选方案2更快 （使用 pd.DataFrame.to_csv() 和 load_data_from_file() 平均快了 17.9 秒，共运行了 3 次循环）：

def load_data_from_file(dataset_id, table_id, source_file_name):
    bigquery_client = bigquery.Client()
    dataset_ref = bigquery_client.dataset(dataset_id)
    table_ref = dataset_ref.table(table_id)
    
    with open(source_file_name, 'rb') as source_file:
        # This example uses CSV, but you can use other formats.
        # See https://cloud.google.com/bigquery/loading-data
        job_config = bigquery.LoadJobConfig()
        job_config.source_format = 'text/csv'
        job_config.autodetect=True
        job = bigquery_client.load_table_from_file(
            source_file, table_ref, job_config=job_config)

    job.result()  # Waits for job to complete

    print('Loaded {} rows into {}:{}.'.format(
        job.output_rows, dataset_id, table_id))

- Pablo

1

我建议你使用pydatalab包（第三个方法）。相对于pandas原生函数，我们使用该软件包从bigquery下载实现了较大的速度提升。 - Nico Albers

这些时间似乎很长。你使用的pandas-gbq版本是什么？0.3.0版本在上传方面应该有明显的加速。 - Maximilian

@NicoAlbers 如果这些库之间存在实质性的差异，我会感到惊讶 - 我发现pandas-gbq与稍微快一点。你有任何例子吗？ - Maximilian

我最近在关于Python和BQ的性能问题上开了一个帖子：https://github.com/pydata/pandas-gbq/issues/133 - Maximilian

1

我刚刚意识到比较的是旧版本，一旦有时间，我会进行比较。 - Nico Albers

3个回答

13

我之前使用 to_gbq() 时也遇到了性能问题，然后我尝试了原生的 Google 客户端，速度更快（大约快了4倍），如果省略等待结果的步骤，速度会快大约20倍。

值得注意的是最好的实践方法是等待结果并进行检查，但在我的情况下，后续有额外的步骤来验证结果。

我正在使用 pandas_gbq 版本0.15（撰写本文时的最新版本）。请尝试这个：

from google.cloud import bigquery
import pandas

df = pandas.DataFrame(
    {
        'my_string': ['a', 'b', 'c'],
        'my_int64': [1, 2, 3],
        'my_float64': [4.0, 5.0, 6.0],
        'my_timestamp': [
            pandas.Timestamp("1998-09-04T16:03:14"),
            pandas.Timestamp("2010-09-13T12:03:45"),
            pandas.Timestamp("2015-10-02T16:00:00")
        ],
    }
)

client = bigquery.Client()
table_id = 'my_dataset.new_table'

# Since string columns use the "object" dtype, pass in a (partial) schema
# to ensure the correct BigQuery data type.
job_config = bigquery.LoadJobConfig(schema=[
    bigquery.SchemaField("my_string", "STRING"),
])

job = client.load_table_from_dataframe(
    df, table_id, job_config=job_config
)

# Wait for the load job to complete. (I omit this step)
# job.result()

- Anonymous

酷！谢谢。 - igorkf

0

您可以使用 pandas.DataFrame.to_gbq()

这里是文档

- Marc

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- enle lin · Accepted Answer

我在 Datalab 中使用以下代码进行了替代方案1和3的比较:

from datalab.context import Context
import datalab.storage as storage
import datalab.bigquery as bq
import pandas as pd
from pandas import DataFrame
import time

# Dataframe to write
my_data = [{1,2,3}]
for i in range(0,100000):
    my_data.append({1,2,3})
not_so_simple_dataframe = pd.DataFrame(data=my_data,columns=['a','b','c'])

#Alternative 1
start = time.time()
not_so_simple_dataframe.to_gbq('TestDataSet.TestTable', 
                 Context.default().project_id,
                 chunksize=10000, 
                 if_exists='append',
                 verbose=False
                 )
end = time.time()
print("time alternative 1 " + str(end - start))

#Alternative 3
start = time.time()
sample_bucket_name = Context.default().project_id + '-datalab-example'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/Hello.txt'
bigquery_dataset_name = 'TestDataSet'
bigquery_table_name = 'TestTable'

# Define storage bucket
sample_bucket = storage.Bucket(sample_bucket_name)

# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_dataframe(not_so_simple_dataframe)

# Write the DataFrame to GCS (Google Cloud Storage)
%storage write --variable not_so_simple_dataframe --object $sample_bucket_object

# Write the DataFrame to a BigQuery table
table.insert_data(not_so_simple_dataframe)
end = time.time()
print("time alternative 3 " + str(end - start))

这里是 n = {10000、100000、1000000} 的结果：

n       alternative_1  alternative_3
10000   30.72s         8.14s
100000  162.43s        70.64s
1000000 1473.57s       688.59s

从结果来看，替代方案3比替代方案1更快。