如何加速Pandas的to_sql()函数？

Question

如何加速Pandas的to_sql()函数？

pythonsqlperformancepandasimport

9

我有一个 1,000,000 x 50 的Pandas DataFrame，目前正在使用以下代码将其写入SQL表中：

df.to_sql('my_table', con, index=False)

这需要很长时间。我在网上看到了各种加速此过程的方法，但似乎没有一种适用于MSSQL的。

如果我尝试以下方法：

Bulk Insert A Pandas DataFrame Using SQLAlchemy

那么我会得到一个no attribute copy_from错误。
如果我尝试以下方法中的多线程方法：

http://techyoubaji.blogspot.com/2015/10/speed-up-pandas-tosql-with.html

那么我会得到一个QueuePool limit of size 5 overflow 10 reach, connection timed out错误。

是否有任何简单的方法可以加速将数据写入MSSQL表格？无论是通过BULK COPY还是其他方法，但完全在Python代码内部实现？

- user1566200

你是在写入现有的表格还是将它创建？ - MaxU - stand with Ukraine

我会使用这个或类似的方法 - BCP应该非常快。 - MaxU - stand with Ukraine

3个回答

6

我使用了ctds来进行批量插入，这样可以更快地使用SQL Server。在下面的示例中，df是Pandas DataFrame。DataFrame中的列顺序与mydb模式完全相同。

import ctds

conn = ctds.connect('server', user='user', password='password', database='mydb')
conn.bulk_insert('table', (df.to_records(index=False).tolist()))

- Babu Arunachalam

每天都在使用这个，而且速度非常快，非常快！ - Caio Belfort

3

我曾经也遇到过同样的问题，所以我使用了SQLAlchemy和Fast Execute Many。

from sqlalchemy import event, create_engine
engine = create_egine('connection_string_with_database')
@event.listens_for(engine, 'before_cursor_execute')
def plugin_bef_cursor_execute(conn, cursor, statement, params, context,executemany):
   if executemany:
       cursor.fast_executemany = True  # replace from execute many to fast_executemany.
       cursor.commit()

请务必确保给定的函数在引擎变量之后且在游标执行之前存在。

conn = engine.execute()
df.to_sql('table', con=conn, if_exists='append', index=False) # for reference go to the pandas to_sql documentation.

- rohit singh

添加装饰器会导致以下问题：('HY090', '[HY090] [Microsoft][ODBC Driver Manager] Invalid string or buffer length (0) (SQLBindParameter)')。 - Led

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Yuval Cohen Hadad · Accepted Answer

在 pandas 0.24 版本中，您可以将方法设置为'multi'，并将块大小设置为1000，这是SQL Server的限制。

chunksize=1000, method='multi'

请参考链接https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method。此参数在0.24.0版本中新增。

参数method控制所使用的SQL插入语句。可能的取值为：

None：使用标准的SQL INSERT语句（每行一个）。 'multi'：在单个INSERT语句中传递多个值。它使用一种特殊的SQL语法，不被所有后端支持。这通常对于像 Presto 和 Redshift 这样的分析数据库提供更好的性能，但如果表包含许多列，则传统的 SQL 后端的性能会更差。更多信息，请查阅 SQLAlchemy 文档。