openpyxl如何从现有数据表中读取表格的示例?

9
在openpyxl文档中,有一个将表格放入工作簿的示例,但没有关于如何找回工作簿中表格的示例。我有一个XLS文件,其中包含命名的表格,我想打开文件,找到所有的表格并解析它们。我找不到任何关于如何做到这一点的文档。有人可以帮忙吗?
同时,我已经解决了这个问题,并编写了以下类来使用openpyxl。
class NamedArray(object):

    ''' Excel Named range object

        Reproduces the named range feature of Microsoft Excel
        Assumes a definition in the form <Worksheet PinList!$A$6:$A$52 provided by openpyxl
        Written for use with, and initialised by the get_names function
        After initialisation named array can be used in the same way as for VBA in excel
        Written for openpyxl version 2.4.1, may not work with earlier versions 
    '''

    C_CAPS = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'   

    def __init__(self, wb, named_range_raw):
        ''' Initialise a NameArray object from the named_range_raw information in the given workbook

        '''
        self.sheet, cellrange_str = str(named_range_raw).split('!')
        self.sheet = self.sheet.replace("'",'') # remove the single quotes if they exist
        self.loc = wb[self.sheet]

        if ':' in cellrange_str:
            self.has_range = True
            self.has_value = False
            lo, hi = cellrange_str.split(':')
            self.ad_lo = lo.replace('$','')
            self.ad_hi = hi.replace('$','')
        else:
            self.has_range = False
            self.has_value = True
            self.ad_lo = cellrange_str.replace('$','')
            self.ad_hi = self.ad_lo

        self.row = self.get_row(self.ad_lo) 
        self.max_row = self.get_row(self.ad_hi)
        self.rows = self.max_row - self.row + 1
        self.min_col = self.col_to_n(self.ad_lo)
        self.max_col = self.col_to_n(self.ad_hi)
        self.cols    = self.max_col - self.min_col + 1


    def size_of(self):
        ''' Returns two dimensional size of named space
        '''
        return self.cols, self.rows 

    def value(self, row=1, col=1):
       ''' Returns the value at row, col
       '''
       assert row <= self.rows , 'invalid row number given'
       assert col <= self.cols , 'invalid column number given'
       return self.loc.cell(self.n_to_col(self.min_col + col-1)+str(self.row + row-1)).value    


    def __str__(self):
        ''' printed description of named space
        '''
        locs = 's ' + self.ad_lo + ':' + self.ad_hi if self.is_range else ' ' + self.ad_lo 
        return('named range'+ str(self.size_of()) + ' in sheet ' + self.sheet + ' @ location' + locs)  


    def __contains__(self, val):
        rval = False
        for row in range(1,self.rows+1):
            for col in range(1,self.cols+1):
                if self.value(row,col) == val:
                    rval = True
        return rval


    def vlookup(self, key, col):
        ''' excel style vlookup function
        '''
        assert col <= self.cols , 'invalid column number given'
        rval = None
        for row in range(1,self.rows+1):
            if self.value(row,1) == key:
                rval = self.value(row, col)
                break
        return rval


    def hlookup(self, key, row):
        ''' excel style hlookup function
        '''
        assert row <= self.rows , 'invalid row number given'
        rval = None
        for col in range(1,self.cols+1):
            if self.value(1,col) == key:
                rval = self.value(row, col)
                break
        return rval

    @classmethod
    def get_row(cls, ad):
        ''' get row number from cell string
        Cell string is assumed to be in excel format i.e "ABC123" where row is 123
        '''
        row = 0
        for l in ad:
            if l in "1234567890":
                row = row*10 + int(l)
        return row

    @classmethod
    def col_to_n(cls, ad):
        ''' find column number from xl address
            Cell string is assumed to be in excel format i.e "ABC123" where column is abc
            column number is integer represenation i.e.(A-A)*26*26 + (B-A)*26 + (C-A)
        '''
        n = 0
        for l in ad:
            if l in cls.C_CAPS:
                n = n*26 + cls.C_CAPS.find(l)+1
        return n

    @classmethod
    def n_to_col(cls, n):
        ''' make xl column address from column number
        '''
        ad = ''
        while n > 0:
            ad = cls.C_CAPS[n%26-1] + ad  
            n = n // 26
        return ad



def get_names(workbook, filt='', debug=False):
    ''' Create a structure containing all of the names in the given workbook

        filt is an optional parameter and used to create a subset of names starting with filt
        useful for IO_ring_spreadsheet as all names start with 'n_'
        if present, filt characters are stipped off the front of the name
    '''
    named_ranges = workbook.defined_names.definedName
    name_list = {}

    for named_range in named_ranges:
        name = named_range.name
        if named_range.attr_text.startswith('#REF'):
            print('WARNING: named range "', name, '" is undefined')
        elif filt == '' or name.startswith(filt):
            name_list[name[len(filt):]] = NamedArray(workbook, named_range.attr_text)

    if debug:
        with open("H:\\names.txt",'w') as log:
            for item in name_list:
                print (item, '=', name_list[item])
                log.write(item.ljust(30) + ' = ' + str(name_list[item])+'\n')

    return name_list
5个回答

12

我同意文档并没有真正帮助,公共API似乎也只有add_table()方法。 但后来我发现了一个openpyxl的Issue 844,请求更好的接口,并且它显示工作表有一个_tables属性。

这足以获取文件中所有表格的列表,以及一些基本属性:

from openpyxl import load_workbook
wb = load_workbook(filename = 'test.xlsx')
for ws in wb.worksheets:
    print("Worksheet %s include %d tables:" % (ws.title, len(ws._tables)))
    for tbl in ws._tables:
        print(" : " + tbl.displayName)
        print("   -  name = " + tbl.name)
        print("   -  type = " + (tbl.tableType if isinstance(tbl.tableType, str) else 'n/a')
        print("   - range = " + tbl.ref)
        print("   - #cols = %d" % len(tbl.tableColumns))
        for col in tbl.tableColumns:
            print("     : " + col.name)
请注意,对于`tableType`,if/else 结构是必需的,因为它可以返回 `NoneType`(用于标准表),而该类型无法转换为 `str`。

1
谢谢这个。看起来_tables应该是私有数据(因此有'_'),但我会尝试一下。最终我编写了自己的解析类,使用wb.defined_names.definedName。 - Stephen Ellwood
嘿@StephenEllwood,你介意将你的类作为“编辑”分享到你当前的问题中吗?非常感谢,@MichalKaut! - Sean McCarthy
1
@SeanMcCarthy,如您所请求,我已经发布了我的代码。请随意改进它。 - Stephen Ellwood
2020年已经到来。现在似乎出了一些问题。i = iter(ws._tables); r = next(i); type(r);的结果是一个字符串。看起来这些方法/属性不可用。 - Jamie Marshall
请注意,这个问题已经在最新版本的openpyxl(3.0.4)中得到解决。 - Jan

11

在@MichalKaut的回答基础上,我创建了一个简单的函数,它返回一个字典,其中包含给定工作簿中的所有表格。它还将每个表格的数据放入Pandas DataFrame中。

from openpyxl import load_workbook
import pandas as pd

def get_all_tables(filename):
    """ Get all tables from a given workbook. Returns a dictionary of tables. 
        Requires a filename, which includes the file path and filename. """
    
    # Load the workbook, from the filename, setting read_only to False
    wb = load_workbook(filename=file, read_only=False, keep_vba=False, data_only=True, keep_links=False)

    # Initialize the dictionary of tables
    tables_dict = {}

    # Go through each worksheet in the workbook
    for ws_name in wb.sheetnames:
        print("")
        print(f"worksheet name: {ws_name}")
        ws = wb[ws_name]
        print(f"tables in worksheet: {len(ws.tables)}")

        # Get each table in the worksheet
        for tbl in ws.tables.values():
            print(f"table name: {tbl.name}")
            # First, add some info about the table to the dictionary
            tables_dict[tbl.name] = {
                    'table_name': tbl.name,
                    'worksheet': ws_name,
                    'num_cols': len(tbl.tableColumns),
                    'table_range': tbl.ref}

            # Grab the 'data' from the table
            data = ws[tbl.ref]

            # Now convert the table 'data' to a Pandas DataFrame
            # First get a list of all rows, including the first header row
            rows_list = []
            for row in data:
                # Get a list of all columns in each row
                cols = []
                for col in row:
                    cols.append(col.value)
                rows_list.append(cols)

            # Create a pandas dataframe from the rows_list. 
            # The first row is the column names
            df = pd.DataFrame(data=rows_list[1:], index=None, columns=rows_list[0])

            # Add the dataframe to the dictionary of tables
            tables_dict[tbl.name]['dataframe'] = df

    return tables_dict
            
# File location:
file = r"C:\Users\sean\spreadsheets\full_of_tables.xlsx"

# Run the function to return a dictionary of all tables in the Excel workbook
tables_dict = get_all_tables(filename=file)

1
请注意,表格列表(ws._tables)只有在以read_only=False模式打开文件时才可访问。 - Marc Meketon
2
这非常有用。感谢@Sean McCarthy和@Michal Kaut。请注意,只有在以read_only=False模式打开文件时才能访问表格列表(ws._tables)。此外,请注意@Sean McCarthy同时使用ws._tablesws.tables。最好在两个地方都使用ws.tables - Marc Meketon
@MarcMeketon 谢谢。我已经采纳了你的意见。 :) - Sean McCarthy

2

答案已经更新。

现在的ws对象包含了表格访问器,它充当字典的作用。更新后的答案是:

tmp = [ws.tables for ws in wb.worksheets]
tbls = [{v.name:v} for t in tmp for v in t.values()]

在openpyxl 3.1.1中,Table对象没有values()元素。 - Georg W.
@GeorgW. - 我不认为这里的t是一个表对象。 如果我记得正确,它是一种不同的类型。 你试过上面的代码了吗? - Jamie Marshall
1
是的,我尝试过了,它是一个“Table”对象。这就是为什么这段代码不能在v3.1.1中工作的原因。 - Georg W.

-2

我不认为这是可能的。它似乎与图像类似地工作;如果您读取并保存带有表格的文件,它将被剥离。


1
原始问题涉及从xlsx或xlsm文件中读取表格。没有必要将它们写回。我想使用这些表格,因为它们实际上是一个命名范围,所以我知道表格中有多少行。另一种选择是查找空单元格。 - Stephen Ellwood

-2

我不确定您所说的解析是什么意思,但自版本2.4.4以来,对工作表表格的读取支持已经成为可能(链接)。如果您对细节有疑问,我建议您在openpyxl邮件列表上提出问题,因为那里更适合进行此类讨论。


1
一个例子会很有用。一个示例链接稍微有点用,但是一个变更日志链接只是告诉我们可能是可能的,这是我们已经怀疑的事情。 - iggie

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接