将加载缓慢的SQL Server表加载到pandas DataFrame中

Question

将加载缓慢的SQL Server表加载到pandas DataFrame中

7

当使用pyodbc并主要使用函数pandas.read_sql(query,pyodbc_conn)从SQL Server数据库加载超过1000万条记录时，Pandas的速度会变得非常慢。下面的代码需要40-45分钟才能从SQL表Table1中加载1000-1500万条记录:

有没有更好、更快的方法将SQL表读入pandas DataFrame?

import pyodbc
import pandas

server = <server_ip> 
database = <db_name> 
username = <db_user> 
password = <password> 
port='1443'
conn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';PORT='+port+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = conn.cursor()

data = pandas.read_sql("select * from Table1", conn) #Takes about 40-45 minutes to complete

- Anjana Shivangi

1

检查块 - BENY

rows = cursor.execute("select * from Table1").fetchall()需要相似的时间吗？ - Gord Thompson

@W-B 块并没有解决时间问题。仍然需要花费很多时间来读取。 - Anjana Shivangi

@GordThompson 谢谢。我尝试使用execute()和fetchall()来读取pyodbc游标对象，但将其转换为pandas Dataframe需要很长时间。请参见链接。 - Anjana Shivangi

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Sai Praneeth · Accepted Answer

我曾遇到过更多行的同样问题，大约有5000万行。最终编写了一个SQL查询，并将它们存储为.h5文件。

sql_reader = pd.read_sql("select * from table_a", con, chunksize=10**5)

hdf_fn = '/path/to/result.h5'
hdf_key = 'my_huge_df'
store = pd.HDFStore(hdf_fn)
cols_to_index = [<LIST OF COLUMNS THAT WE WANT TO INDEX in HDF5 FILE>]

for chunk in sql_reader:
    store.append(hdf_key, chunk, data_columns=cols_to_index, index=False)

# index data columns in HDFStore
store.create_table_index(hdf_key, columns=cols_to_index, optlevel=9, kind='full')
store.close()

这样做，我们将能够比Pandas.read_csv更快地读取它们。