如何查看read_csv的进度条

Question

如何查看read_csv的进度条

13

我正在尝试读取100GB大小的csv文件
我希望在读取文件时能够看到进度条

file = pd.read_csv("../code/csv/file.csv")

喜好 =====> 30%
在读取read_csv或其他文件时，是否有办法查看进度条？

- user11173832

3

根据你读取文件的方式而定。如果你在进行迭代，tqdm 或 progressbar2 可以处理，但是对于单个原子操作，通常很难获得进度条（因为你实际上不能进入操作以查看当前进度）。我认为，在 tqdm 中有一些用于 HTTP 请求的解决方法，但我不认为 pandas 有这样的功能。 - Green Cloak Guy

3

我会建议只使用“chunk”。 - BENY

1

可能是如何解决使用pandas读取大型CSV文件时的内存问题的重复问题。 - Billal Begueradj

2个回答

2

使用typer模块可以得到漂亮的输出效果，我在Jupyter Notebook上测试了一个有618k行的大型分隔文本文件。


from pathlib import Path
import pandas as pd
import tqdm
import typer

txt = Path("<path-to-massive-delimited-txt-file>").resolve()

# read number of rows quickly
length = sum(1 for row in open(txt, 'r'))

# define a chunksize
chunksize = 5000

# initiate a blank dataframe
df = pd.DataFrame()

# fancy logging with typer
typer.secho(f"Reading file: {txt}", fg="red", bold=True)
typer.secho(f"total rows: {length}", fg="green", bold=True)

# tqdm context
with tqdm.auto.tqdm(total=length, desc="chunks read: ") as bar:
    # enumerate chunks read without low_memory (it is massive for pandas to precisely assign dtypes)
    for i, chunk in enumerate(pd.read_csv(txt, chunksize=chunksize, low_memory=False)):
        
        # print the chunk number
        print(i)
        
        # append it to df
        df = df.append(other=chunk)
        
        # update tqdm progress bar
        bar.update(chunksize)
        
        # 6 chunks are enough to test
        if i==5:
            break
            
# finally inform with a friendly message
typer.secho("end of reading chunks...", fg=typer.colors.BRIGHT_RED)
typer.secho(f"Dataframe length:{len(df)}", fg="green", bold=True)

Jupyter Notebook 输出 - png

。

- OzInClouds

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ofer Rahat · Accepted Answer

这个想法是从大文件中读取几行来估计行大小，然后迭代文件的块。

import os
import sys
from tqdm import tqdm


INPUT_FILENAME = f"{BASE_PATH}betas_R_SWAN_offset_100.csv.gz"
LINES_TO_READ_FOR_ESTIMATION = 20
CHUNK_SIZE_PER_ITERATION = 10**5


temp = pd.read_csv(INPUT_FILENAME,
                   nrows=LINES_TO_READ_FOR_ESTIMATION)
N = len(temp.to_csv(index=False))
df = [temp[:0]]
t = int(os.path.getsize(INPUT_FILENAME)/N*LINES_TO_READ_FOR_ESTIMATION/CHUNK_SIZE_PER_ITERATION) + 1


with tqdm(total = t, file = sys.stdout) as pbar:
    for i,chunk in enumerate(pd.read_csv(INPUT_FILENAME, chunksize=CHUNK_SIZE_PER_ITERATION, low_memory=False)):
        df.append(chunk)
        pbar.set_description('Importing: %d' % (1 + i))
        pbar.update(1)

data = temp[:0].append(df)
del df