使用Python Pandas将现有的Excel表格追加新的数据框。

68

我目前有这段代码,它可以完美地执行。

它会循环读取一个文件夹中的Excel文件, 删除前两行,然后将它们另存为单独的Excel文件, 同时也将每个循环内的文件作为附加文件保存。

目前,附加文件每次运行代码时都会覆盖现有的文件。

我需要将新数据追加到已存在的Excel表格底部('master_data.xlsx)。

dfList = []
path = 'C:\\Test\\TestRawFile' 
newpath = 'C:\\Path\\To\\New\\Folder'

for fn in os.listdir(path): 
  # Absolute file path
  file = os.path.join(path, fn)
  if os.path.isfile(file): 
    # Import the excel file and call it xlsx_file 
    xlsx_file = pd.ExcelFile(file) 
    # View the excel files sheet names 
    xlsx_file.sheet_names 
    # Load the xlsx files Data sheet as a dataframe 
    df = xlsx_file.parse('Sheet1',header= None) 
    df_NoHeader = df[2:] 
    data = df_NoHeader 
    # Save individual dataframe
    data.to_excel(os.path.join(newpath, fn))

    dfList.append(data) 

appended_data = pd.concat(dfList)
appended_data.to_excel(os.path.join(newpath, 'master_data.xlsx'))

我原以为这会是一个简单的任务,但好像并不是。 我认为我需要将master_data.xlsx文件作为数据帧导入,然后将索引与新附加的数据匹配,并将其保存回去。或者也许有更简单的方法。任何帮助都将不胜感激。


1
你是想要的吗?这个 - MaxU - stand with Ukraine
3
不完全是,我不想保存新的表格,只是尝试追加现有的表格。 - brandog
11个回答

121

更新 [2022-01-08]:从1.4.0版本开始,Pandas支持在现有Excel表格中追加内容,保留旧内容,“开箱即用”!

Pandas团队做得很好!

以下摘自ExcelWriter文档

if_sheet_exists : {'error', 'new', 'replace', 'overlay'}, default 'error'
    How to behave when trying to write to a sheet that already
    exists (append mode only).
    ...
    * overlay: Write contents to the existing sheet without removing the old contents.

    .. versionadded:: 1.3.0
    .. versionchanged:: 1.4.0
       Added ``overlay`` option

对于 Pandas 版本小于 1.4.0,请使用下面的辅助函数将 Pandas DataFrame 追加到现有的 Excel 文件中。

如果 Excel 文件不存在,则会创建一个新文件。


更新 [2021-09-12]:已修复 Pandas 1.3.0+ 版本的问题

以下函数已经通过了以下测试:

  • Pandas 1.3.2
  • OpenPyxl 3.0.7

from pathlib import Path
from copy import copy
from typing import Union, Optional
import numpy as np
import pandas as pd
import openpyxl
from openpyxl import load_workbook
from openpyxl.utils import get_column_letter


def copy_excel_cell_range(
        src_ws: openpyxl.worksheet.worksheet.Worksheet,
        min_row: int = None,
        max_row: int = None,
        min_col: int = None,
        max_col: int = None,
        tgt_ws: openpyxl.worksheet.worksheet.Worksheet = None,
        tgt_min_row: int = 1,
        tgt_min_col: int = 1,
        with_style: bool = True
) -> openpyxl.worksheet.worksheet.Worksheet:
    """
    copies all cells from the source worksheet [src_ws] starting from [min_row] row
    and [min_col] column up to [max_row] row and [max_col] column
    to target worksheet [tgt_ws] starting from [tgt_min_row] row
    and [tgt_min_col] column.

    @param src_ws:  source worksheet
    @param min_row: smallest row index in the source worksheet (1-based index)
    @param max_row: largest row index in the source worksheet (1-based index)
    @param min_col: smallest column index in the source worksheet (1-based index)
    @param max_col: largest column index in the source worksheet (1-based index)
    @param tgt_ws:  target worksheet.
                    If None, then the copy will be done to the same (source) worksheet.
    @param tgt_min_row: target row index (1-based index)
    @param tgt_min_col: target column index (1-based index)
    @param with_style:  whether to copy cell style. Default: True

    @return: target worksheet object
    """
    if tgt_ws is None:
        tgt_ws = src_ws

    # https://dev59.com/xWAg5IYBdhLWcg3wq8aU#34838233
    for row in src_ws.iter_rows(min_row=min_row, max_row=max_row,
                                min_col=min_col, max_col=max_col):
        for cell in row:
            tgt_cell = tgt_ws.cell(
                row=cell.row + tgt_min_row - 1,
                column=cell.col_idx + tgt_min_col - 1,
                value=cell.value
            )
            if with_style and cell.has_style:
                # tgt_cell._style = copy(cell._style)
                tgt_cell.font = copy(cell.font)
                tgt_cell.border = copy(cell.border)
                tgt_cell.fill = copy(cell.fill)
                tgt_cell.number_format = copy(cell.number_format)
                tgt_cell.protection = copy(cell.protection)
                tgt_cell.alignment = copy(cell.alignment)
    return tgt_ws


def append_df_to_excel(
        filename: Union[str, Path],
        df: pd.DataFrame,
        sheet_name: str = 'Sheet1',
        startrow: Optional[int] = None,
        max_col_width: int = 30,
        autofilter: bool = False,
        fmt_int: str = "#,##0",
        fmt_float: str = "#,##0.00",
        fmt_date: str = "yyyy-mm-dd",
        fmt_datetime: str = "yyyy-mm-dd hh:mm",
        truncate_sheet: bool = False,
        storage_options: Optional[dict] = None,
        **to_excel_kwargs
) -> None:
    """
    Append a DataFrame [df] to existing Excel file [filename]
    into [sheet_name] Sheet.
    If [filename] doesn't exist, then this function will create it.

    @param filename: File path or existing ExcelWriter
                     (Example: '/path/to/file.xlsx')
    @param df: DataFrame to save to workbook
    @param sheet_name: Name of sheet which will contain DataFrame.
                       (default: 'Sheet1')
    @param startrow: upper left cell row to dump data frame.
                     Per default (startrow=None) calculate the last row
                     in the existing DF and write to the next row...
    @param max_col_width: maximum column width in Excel. Default: 40
    @param autofilter: boolean - whether add Excel autofilter or not. Default: False
    @param fmt_int: Excel format for integer numbers
    @param fmt_float: Excel format for float numbers
    @param fmt_date: Excel format for dates
    @param fmt_datetime: Excel format for datetime's
    @param truncate_sheet: truncate (remove and recreate) [sheet_name]
                           before writing DataFrame to Excel file
    @param storage_options: dict, optional
        Extra options that make sense for a particular storage connection, e.g. host, port,
        username, password, etc., if using a URL that will be parsed by fsspec, e.g.,
        starting “s3://”, “gcs://”.
    @param to_excel_kwargs: arguments which will be passed to `DataFrame.to_excel()`
                            [can be a dictionary]
    @return: None

    Usage examples:

    >>> append_df_to_excel('/tmp/test.xlsx', df, autofilter=True,
                           freeze_panes=(1,0))

    >>> append_df_to_excel('/tmp/test.xlsx', df, header=None, index=False)

    >>> append_df_to_excel('/tmp/test.xlsx', df, sheet_name='Sheet2',
                           index=False)

    >>> append_df_to_excel('/tmp/test.xlsx', df, sheet_name='Sheet2',
                           index=False, startrow=25)

    >>> append_df_to_excel('/tmp/test.xlsx', df, index=False,
                           fmt_datetime="dd.mm.yyyy hh:mm")

    (c) [MaxU](https://stackoverflow.com/users/5741205/maxu?tab=profile)
    """
    def set_column_format(ws, column_letter, fmt):
        for cell in ws[column_letter]:
            cell.number_format = fmt
    filename = Path(filename)
    file_exists = filename.is_file()
    # process parameters
    # calculate first column number
    # if the DF will be written using `index=True`, then `first_col = 2`, else `first_col = 1`
    first_col = int(to_excel_kwargs.get("index", True)) + 1
    # ignore [engine] parameter if it was passed
    if 'engine' in to_excel_kwargs:
        to_excel_kwargs.pop('engine')
    # save content of existing sheets
    if file_exists:
        wb = load_workbook(filename)
        sheet_names = wb.sheetnames
        sheet_exists = sheet_name in sheet_names
        sheets = {ws.title: ws for ws in wb.worksheets}

    with pd.ExcelWriter(
        filename.with_suffix(".xlsx"),
        engine="openpyxl",
        mode="a" if file_exists else "w",
        if_sheet_exists="new" if file_exists else None,
        date_format=fmt_date,
        datetime_format=fmt_datetime,
        storage_options=storage_options
    ) as writer:
        if file_exists:
            # try to open an existing workbook
            writer.book = wb
            # get the last row in the existing Excel sheet
            # if it was not specified explicitly
            if startrow is None and sheet_name in writer.book.sheetnames:
                startrow = writer.book[sheet_name].max_row
            # truncate sheet
            if truncate_sheet and sheet_name in writer.book.sheetnames:
                # index of [sheet_name] sheet
                idx = writer.book.sheetnames.index(sheet_name)
                # remove [sheet_name]
                writer.book.remove(writer.book.worksheets[idx])
                # create an empty sheet [sheet_name] using old index
                writer.book.create_sheet(sheet_name, idx)
            # copy existing sheets
            writer.sheets = sheets
        else:
            # file doesn't exist, we are creating a new one
            startrow = 0

        # write out the DataFrame to an ExcelWriter
        df.to_excel(writer, sheet_name=sheet_name, **to_excel_kwargs)
        worksheet = writer.sheets[sheet_name]

        if autofilter:
            worksheet.auto_filter.ref = worksheet.dimensions

        for xl_col_no, dtyp in enumerate(df.dtypes, first_col):
            col_no = xl_col_no - first_col
            width = max(df.iloc[:, col_no].astype(str).str.len().max(),
                        len(df.columns[col_no]) + 6)
            width = min(max_col_width, width)
            column_letter = get_column_letter(xl_col_no)
            worksheet.column_dimensions[column_letter].width = width
            if np.issubdtype(dtyp, np.integer):
                set_column_format(worksheet, column_letter, fmt_int)
            if np.issubdtype(dtyp, np.floating):
                set_column_format(worksheet, column_letter, fmt_float)

    if file_exists and sheet_exists:
        # move (append) rows from new worksheet to the `sheet_name` worksheet
        wb = load_workbook(filename)
        # retrieve generated worksheet name
        new_sheet_name = set(wb.sheetnames) - set(sheet_names)
        if new_sheet_name:
            new_sheet_name = list(new_sheet_name)[0]
        # copy rows written by `df.to_excel(...)` to
        copy_excel_cell_range(
            src_ws=wb[new_sheet_name],
            tgt_ws=wb[sheet_name],
            tgt_min_row=startrow + 1,
            with_style=True
        )
        # remove new (generated by Pandas) worksheet
        del wb[new_sheet_name]
        wb.save(filename)
        wb.close()

旧版本(已测试过Pandas 1.2.3和Openpyxl 3.0.5):

import os
from openpyxl import load_workbook


def append_df_to_excel(filename, df, sheet_name='Sheet1', startrow=None,
                       truncate_sheet=False, 
                       **to_excel_kwargs):
    """
    Append a DataFrame [df] to existing Excel file [filename]
    into [sheet_name] Sheet.
    If [filename] doesn't exist, then this function will create it.

    @param filename: File path or existing ExcelWriter
                     (Example: '/path/to/file.xlsx')
    @param df: DataFrame to save to workbook
    @param sheet_name: Name of sheet which will contain DataFrame.
                       (default: 'Sheet1')
    @param startrow: upper left cell row to dump data frame.
                     Per default (startrow=None) calculate the last row
                     in the existing DF and write to the next row...
    @param truncate_sheet: truncate (remove and recreate) [sheet_name]
                           before writing DataFrame to Excel file
    @param to_excel_kwargs: arguments which will be passed to `DataFrame.to_excel()`
                            [can be a dictionary]
    @return: None

    Usage examples:

    >>> append_df_to_excel('d:/temp/test.xlsx', df)

    >>> append_df_to_excel('d:/temp/test.xlsx', df, header=None, index=False)

    >>> append_df_to_excel('d:/temp/test.xlsx', df, sheet_name='Sheet2',
                           index=False)

    >>> append_df_to_excel('d:/temp/test.xlsx', df, sheet_name='Sheet2', 
                           index=False, startrow=25)

    (c) [MaxU](https://stackoverflow.com/users/5741205/maxu?tab=profile)
    """
    # Excel file doesn't exist - saving and exiting
    if not os.path.isfile(filename):
        df.to_excel(
            filename,
            sheet_name=sheet_name, 
            startrow=startrow if startrow is not None else 0, 
            **to_excel_kwargs)
        return
    
    # ignore [engine] parameter if it was passed
    if 'engine' in to_excel_kwargs:
        to_excel_kwargs.pop('engine')

    writer = pd.ExcelWriter(filename, engine='openpyxl', mode='a')

    # try to open an existing workbook
    writer.book = load_workbook(filename)
    
    # get the last row in the existing Excel sheet
    # if it was not specified explicitly
    if startrow is None and sheet_name in writer.book.sheetnames:
        startrow = writer.book[sheet_name].max_row

    # truncate sheet
    if truncate_sheet and sheet_name in writer.book.sheetnames:
        # index of [sheet_name] sheet
        idx = writer.book.sheetnames.index(sheet_name)
        # remove [sheet_name]
        writer.book.remove(writer.book.worksheets[idx])
        # create an empty sheet [sheet_name] using old index
        writer.book.create_sheet(sheet_name, idx)
    
    # copy existing sheets
    writer.sheets = {ws.title:ws for ws in writer.book.worksheets}

    if startrow is None:
        startrow = 0

    # write out the new sheet
    df.to_excel(writer, sheet_name, startrow=startrow, **to_excel_kwargs)

    # save the workbook
    writer.save()

使用示例:
filename = r'C:\OCC.xlsx'

append_df_to_excel(filename, df)

append_df_to_excel(filename, df, header=None, index=False)

append_df_to_excel(filename, df, sheet_name='Sheet2', index=False)

append_df_to_excel(filename, df, sheet_name='Sheet2', index=False, startrow=25)

c:/temp/test.xlsx:

enter image description here

如果您不想重复列名,您可以指定header=None...

更新: 您还可以查看这个旧的解决方案


2
它只是覆盖指定索引处的文件,而不是附加到现有文件的末尾。 - keramat
2
从pandas 1.2.0开始,该代码会创建一个问题(在1.1.5之前正常工作),引发BadZipFile异常,因为当你实例化pd.ExcelWriter时,它会创建一个大小为0字节的空文件并覆盖现有文件。必须指定mode ='a'。请参见https://stackoverflow.com/a/66585065/4046632和https://dev59.com/_r_qa4cB1Zd3GeqPOLzB。 - buran
1
@Priya,很高兴能帮到你。:) 在Pandas 1.2.0中有一些变化,导致我的函数旧实现出现了BadZipFile异常,因此我不得不将其适应新的Pandas版本。 - MaxU - stand with Ukraine
1
@Priya,参数 truncate_sheet=True 可以用于在将 DataFrame 写入工作表之前清除工作表的旧内容。 - MaxU - stand with Ukraine
1
这个解决方案在 pandas 1.2.3 中有效,但在 pandas 1.3.0 中会产生以下错误: ValueError: Sheet 'Sheet1' already exists and if_sheet_exists is set to 'error'。 查看源代码发现,参数 if_sheet_exists 是 1.3.0 中新增的。append 模式的行为似乎已更改为追加一个新工作表而不是将数据追加到同一工作表中。有人有解决方法吗? - Chris
显示剩余30条评论

25
如果您不是严格要求Excel文件,那么可以将输出作为CSV文件并将其复制到新的Excel文件中。注意: 这仅适用于列数小于1000的情况,因为CSV对于您可以写入的列数有限制。 df.to_csv('filepath', mode='a', index = False, header=None) mode='a'表示追加。虽然这是一个绕路但方法可行!

6

在MaxU 和其他人的代码与评论的基础上进行改进,但简化只修复pandas ExcelWriter中的错误,该错误导致在附加模式下to_excel创建新表而不是追加到现有表。

正如其他人所指出的那样,to_excel使用ExcelWriter.sheets属性,但当由ExcelWriter使用时,这个属性未被填充。

修复只需一行代码,否则代码是标准的pandas方法,就像to_excel文档中所述。


    # xl_path is destination xlsx spreadsheet
    with pd.ExcelWriter(xl_path, 'openpyxl', mode='a') as writer:
        # fix line
        writer.sheets = dict((ws.title, ws) for ws in writer.book.worksheets)
        df.to_excel(writer, sheet_name)


删除行:writer.sheets = dict((ws.title, ws) for ws in writer.book.worksheets)。现有的工作表将被覆盖。 - stansy
抛出错误 writer.sheets = dict((ws.title, ws) for ws in writer.book.worksheets)属性错误:无法设置属性 - Aqua 4
抛出一个错误 writer.sheets = dict((ws.title, ws) for ws in writer.book.worksheets)属性错误:无法设置属性 - undefined

6

这对我有用

import os
import openpyxl
import pandas as pd
from openpyxl.utils.dataframe import dataframe_to_rows

file = r"myfile.xlsx"

df = pd.DataFrame({'A': 1, 'B': 2})

# create excel file
if os.path.isfile(file):  # if file already exists append to existing file
    workbook = openpyxl.load_workbook(file)  # load workbook if already exists
    sheet = workbook['my_sheet_name']  # declare the active sheet 

    # append the dataframe results to the current excel file
    for row in dataframe_to_rows(df, header = False, index = False):
        sheet.append(row)
    workbook.save(file)  # save workbook
    workbook.close()  # close workbook
else:  # create the excel file if doesn't already exist
    with pd.ExcelWriter(path = file, engine = 'openpyxl') as writer:
        df.to_excel(writer, index = False, sheet_name = 'my_sheet_name')

1
我尝试了很多次,奇迹般地这个方法有效了!谢谢。 - Subaru Spirit

3
如果您每次在工作表上使用ExcelWriter,它都会覆盖之前的工作表,并且所有可见的内容都将是您附加到工作簿的最后一个数据工作表。 相反,您可以维护一个初始为1的计数器,为此您需要初始化Excel工作表并使用现有方法添加初始数据。
writer = pd.ExcelWriter(output_file, engine='openpyxl')

df = pd.read_excel(output_file, sheet_name='TestSheet1')

或者您可以采用我使用的以下方法。这样下次您想要使用它时就可以加载工作簿了,否则在第一种情况下尝试加载它会抛出文件未找到异常。

用法:

from bs4 import BeautifulSoup
import requests
import pandas as pd
from openpyxl import load_workbook

urls = ["http://millenniumcricketleague.com/Home/ShowTeam.aspx?tid=22",
        "http://millenniumcricketleague.com/Home/ShowTeam.aspx?tid=40"]
path = "F:\meta_1.xlsx"
writer = pd.ExcelWriter(path,engine='openpyxl')
counter = 1
for url in urls:
    table_data = []
    final = []
    html_content = requests.get(url).text
    soup = BeautifulSoup(html_content, "lxml")
    x = soup.find_all('table')
    for table in x[1:]:
        for tr in table.find_all("tr"):
            newrow = []
            for td in tr.find_all("td"):
                newrow.append(td.text.replace('\n', ' ').strip())
            table_data.append(newrow)
    df = pd.DataFrame(table_data)
    sheetname = 'Sheet%s' % counter
    if(counter!=1):
        writer.book = load_workbook(path)
    df.to_excel(writer, sheet_name=sheetname)
    counter = counter + 1
    writer.save()


无需关闭ExcelWriter,它会自动关闭。如果您明确定义了它,将会显示警告。

3
import pandas as pd
import openpyxl

workbook = openpyxl.load_workbook("test.xlsx")
writer = pd.ExcelWriter('test.xlsx', engine='openpyxl')
writer.book = workbook
writer.sheets = dict((ws.title, ws) for ws in workbook.worksheets)
data_df.to_excel(writer, 'Existing_sheetname')
writer.save()
writer.close()

6
代码转储不是好的答案。您应该解释如何为什么这样解决了他们的问题。我建议阅读“如何编写一个好的答案?”这篇文章。这可以帮助未来的用户学习,并最终将这些知识应用于他们自己的代码中。当解释代码时,您也很可能会得到积极的反馈/赞同。 - John Conde
1
同意,这里提供一些解释会非常有帮助! - Kris
1
这是一个误导性的答案。这段代码只会更改Excel文件中的第一行...在分享之前,请先尝试并确保您的代码正确。 - ati ince

0
from openpyxl import load_workbook
wb = load_workbook(filepath)
ws = wb["Sheet1"]
df = dataframe.values.tolist()
for i in range(len(df)):
    ws.append(df[i])
wb.save(filepath)

嗨,如果您能帮助我们理解您的代码是如何解决 OP 的问题的,那将非常棒! - Simas Joneliunas

0

这个问题已经存在一段时间了。答案可以,但我相信这会解决大多数人的问题。

只需使用glob来访问特定目录中的文件,循环遍历它们,创建每个文件的数据帧,将其附加到最后一个,然后导出到一个文件夹。我还包括了注释掉的代码来处理csv文件。

import os
import pandas as pd
import glob

# put in path to folder with files you want to append
# *.xlsx or *.csv will get all files of that type
path = "C:/Users/Name/Folder/*.xlsx"
#path = "C:/Users/Name/Folder/*.csv"

# initialize a empty df
appended_data = pd.DataFrame()

#loop through each file in the path
for file in glob.glob(path):
    print(file)

    # create a df of that file path
    df = pd.read_excel(file, sheet_name = 0)
    #df = pd.read_csv(file, sep=',')

    # appened it
    appended_data = appended_data.append(df)

appended_data

# export the appeneded data to a folder of your choice
exportPath = 'C:/My/EXPORT/PATH/appended_dataExport.csv'
appended_data.to_csv(os.path.join(exportPath),index=False)

0
补充@david所说的,如果您不关心索引并且可以使用.csv文件,则此函数可帮助将任何数据框追加到现有的csv文件中。
def append_df(self, path_file, df):
    with open(path_file, 'a+') as f:
        df.to_csv(f, header=f.tell() == 0, encoding='utf-8', index=False)

注释:

a+ 如果文件不存在则创建

f.tell() == 0 如果是第一行,则添加标题


你的方法需要读取所有现有的Excel并收集数据到df中,然后在df中添加新行,最后写入Excel。如果你处理的是非常大的Excel文件,这可能会是一种不太理想的方法 :( - ati ince

0
下面是一个非常简单的示例,演示如何将数据附加到现有的Excel文件中。
last_row = pd.read_excel("output.xlsx").index.stop + 1 
df1 = pd.DataFrame([["Aashutosh0012", "https://aashutosh.fly.dev"]], columns=["Header A", "Header B"])  
    with pd.ExcelWriter("output.xlsx", mode="a", engine="openpyxl", if_sheet_exists="overlay") as writer:
        df.to_excel(writer, startrow=last_row, index=False, header=False)

我尝试了上面列出的许多示例,但都没有起作用,直到你在pd.ExcelWriter中传递startrow参数,它才会覆盖现有行。
下面是一个函数,用于创建一个Excel文件(如果不存在),否则将数据追加到现有Excel文件的末尾。
import os
import pandas as pd

def save_to_excel(data: list, excel_file="output.xlsx"):
    headers = ["Header A", "Header B"]
    data_df = pd.DataFrame(data, columns=headers)    
    try:
        # Check if the file exists or not
        file_exists = os.path.isfile(excel_file)        
        # Try to append data to an existing Excel file or create a new one
        if file_exists:
            # get last row postion in existing excel file
            last_row = pd.read_excel(excel_file).index.stop + 1 
            with pd.ExcelWriter(excel_file, mode="a", if_sheet_exists="overlay") as writer:
                data_df.to_excel(writer, startrow = last_row, index=False, header=False)
                print(f'Data appended to file {excel_file}.')
        else:
            with pd.ExcelWriter(excel_file, mode="w") as writer:
                data_df.to_excel(writer, index=False)
                print(f'New file {excel_file} created and data saved.')                    
    except PermissionError:
        print(f'Error: Permission denied while trying to access {excel_file}.')
    except Exception as e:
        print(f'An error occurred: {e}')

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接