将表格转换为字典的Python代码

Question

将表格转换为字典的Python代码

3

问题：一个命令生成的表格很难用脚本处理。

解决方案：将表格转换为Python字典，以便更高效地使用。这些表格可以有1-20个不同的虚拟驱动器和属性，例如“名称”可能未设置。

示例表格：

Virtual Drives = 4

VD LIST :
=======

----------------------------------------------------------
DG/VD TYPE   State Access Consist Cache sCC     Size Name
----------------------------------------------------------
0/0   RAID1  Optl  RW     No      RWTD  -   1.818 TB one
1/1   RAID1  Optl  RW     No      RWTD  -   1.818 TB two
2/2   RAID1  Optl  RW     No      RWTD  -   1.818 TB three
3/3   RAID1  Optl  RW     No      RWTD  -   1.818 TB four
4/4   RAID10 Reblg RW     No      RWTD  -   4.681 TB 
----------------------------------------------------------

例子字典：

{"DG/VD":"0/0", "TYPE":"RAID1", "State":"Optl", "Access":"RW", "Consist":"No", "Cache":"RWTD", "sCC":"-", "Size":"1.818 TB", "Name":"one"}
{"DG/VD":"4/4", "TYPE":"RAID10", "State":"Reblg", "Access":"RW", "Consist":"No", "Cache":"RWTD", "sCC":"-", "Size":"4.681 TB", "Name":""}

将会有四个字典，每一个对应一个虚拟驱动器。如何解决这个问题呢？

我有一些想法。首先，搜索表头并在空格处分隔定义一个列表。其次，在“数字/数字”和空格处查找虚拟驱动器来定义第二个列表。然而，“大小”需要特殊处理，因为它需要忽略数字和“TB”之间的空格。

然后，将两个列表压缩在一起生成一个字典。还有其他更好的方式来处理这段文本吗？

# Create a list of all the headers in the virtual disk table
get_table_header = " /c0/vall show | awk '/^DG\/VD/'"
table_header_values = console_info(utility + get_table_header).split()

['DG/VD', 'TYPE', 'State', 'Access', 'Consist', 'Cache', 'sCC', 'Size', 'Name']

# Create a list of all virtual drives
get_virtual_drives = " /c0/vall show | awk '/^[0-9]\/[0-9]/'"
virtual_drive_values = console_info(utility + get_virtual_drives).split()

["0/0", "RAID1", "Optl", "RW", "No", "RWTD", "-", "1.818", "TB", "0"]

- except UserError

很可惜的是，列标题没有统一的对齐样式，否则使用它们作为分割依据会更容易。如果只是由于某种原因，"大小"不与其列的右侧对齐就好了... - undefined

如果你希望每个硬盘的这些属性与该硬盘绑定在一起，那么你需要一个硬盘列表，其中每个硬盘都有一个字典作为其元素。 - undefined

先把东西弄好，然后再改进。 - undefined

Size 总是存在吗？ - undefined

是的，所有九个标题都将始终存在。所有的值也将在表中，除了名称，它可以显示为空字段。 - undefined

1

好的，修改后的代码应该能处理所有情况。 - undefined

6个回答

3

一种方法是将表格读入pandas DataFrame中，并使用其to_dict方法将其转换为字典。

这需要将原始表格复制到文件中，删除虚线，然后在Size之后的头部添加一个额外的列，可能称为Units以容纳“TB”数据，或者不添加额外的列并删除或替换每个Size数据和“TB”之间的空格（可能用破折号代替）。

然后可以使用df.read_csv方法将文件作为'\s+'分隔的csv文件加载到pandas DataFrame（df）中，并将其转置为字典，即使用df.T（与df.transpose相同）和df.to_dict方法。2D表格的转置只是交换其列和行，即列变为行，行变为列。

从包含以下内容的table.txt开始：

DG/VD TYPE  State Access Consist Cache sCC     Size Units Name
0/0   RAID1 Optl  RW     No      RWTD  -   1.818 TB one
1/1   RAID1 Optl  RW     No      RWTD  -   1.818 TB two
2/2   RAID1 Optl  RW     No      RWTD  -   1.818 TB three
3/3   RAID1 Optl  RW     No      RWTD  -   1.818 TB four

以下代码将其转换为一个名为table_dict的字典：

import pandas as pd
table_dict = pd.read_csv('table.txt',sep='\s+').to_dict(orient= 'index')

import pprint
pprint.pprint(table_dict)

{0: {'Access': 'RW',
     'Cache': 'RWTD',
     'Consist': 'No',
     'DG/VD': '0/0',
     'Name': 'one',
     'Size': 1.818,
     'State': 'Optl',
     'TYPE': 'RAID1',
     'Units': 'TB',
     'sCC': '-'},
 1: {'Access': 'RW',
     'Cache': 'RWTD',
     'Consist': 'No',
     'DG/VD': '1/1',
     'Name': 'two',
     'Size': 1.818,
     'State': 'Optl',
     'TYPE': 'RAID1',
     'Units': 'TB',
     'sCC': '-'},
 2: {'Access': 'RW',
     'Cache': 'RWTD',
     'Consist': 'No',
     'DG/VD': '2/2',
     'Name': 'three',
     'Size': 1.818,
     'State': 'Optl',
     'TYPE': 'RAID1',
     'Units': 'TB',
     'sCC': '-'},
 3: {'Access': 'RW',
     'Cache': 'RWTD',
     'Consist': 'No',
     'DG/VD': '3/3',
     'Name': 'four',
     'Size': 1.818,
     'State': 'Optl',
     'TYPE': 'RAID1',
     'Units': 'TB',
     'sCC': '-'}}

pandas DataFrame有其他格式转换的附加方法，包括JSON、HTML、SQL等等。

- user4322779

2

您可以使用struct模块解析下面展示的表中行的数据，该模块将数据存储在OrderedDict中以保留生成的字典中的字段顺序，但这并不是必须的。"Name"字段不一定需要出现。

from __future__ import print_function
from collections import OrderedDict
import json  # for pretty-printing results
import struct
from textwrap import dedent

table = dedent("""
    Virtual Drives = 4

    VD LIST :
    =======

    ----------------------------------------------------------
    DG/VD TYPE  State Access Consist Cache sCC     Size Name
    ----------------------------------------------------------
    0/0   RAID1 Optl  RW     No      RWTD  -   1.818 TB one
    1/1   RAID1 Optl  RW     No      RWTD  -   1.818 TB two
    2/2   RAID1 Optl  RW     No      RWTD  -   1.818 TB three
    3/3   RAID1 Optl  RW     No      RWTD  -   1.818 TB four
    ----------------------------------------------------------
""")

# negative widths represent ignored padding fields
fieldwidths = 3, -3, 5, -1, 4, -2, 2, -5, 3, -5, 4, -2, 3, -1, 8, -1, 5
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                        for fw in fieldwidths)
fieldstruct = struct.Struct(fmtstring)
parse = fieldstruct.unpack_from

itable = iter(table.splitlines())
for line in itable:
    if line.startswith('-----'):
        break

fieldnames = next(itable).split()

for line in itable:
    if line.startswith('-----'):
        break

for line in itable:
    if line.startswith('-----'):
        break
    if len(line) < fieldstruct.size:
        line += ' ' * (fieldstruct.size - len(line))
    fields = tuple(field.strip() for field in parse(line))
    rec = OrderedDict(zip(fieldnames, fields))
    print(json.dumps(rec))

输出：

{"DG/VD": "0/0", "TYPE": "RAID1", "State": "Optl", "Access": "RW", 
 "Consist": "No", "Cache": "RWTD", "sCC": "-", "Size": "1.818 TB", 
 "Name": "one"}
{"DG/VD": "1/1", "TYPE": "RAID1", "State": "Optl", "Access": "RW", 
 "Consist": "No", "Cache": "RWTD", "sCC": "-", "Size": "1.818 TB", 
 "Name": "two"}
{"DG/VD": "2/2", "TYPE": "RAID1", "State": "Optl", "Access": "RW", 
 "Consist": "No", "Cache": "RWTD", "sCC": "-", "Size": "1.818 TB", 
 "Name": "three"}
{"DG/VD": "3/3", "TYPE": "RAID1", "State": "Optl", "Access": "RW", 
 "Consist": "No", "Cache": "RWTD", "sCC": "-", "Size": "1.818 TB", 
 "Name": "four"}

- martineau

有趣的结构使用 - undefined

1

@Padraic：谢谢。这只是对我的答案（https://dev59.com/aW445IYBdhLWcg3wWI2L#4915359）的一个微小改编，用于处理长度可变的问题[_Efficient way of parsing fixed width files in Python_](https://dev59.com/aW445IYBdhLWcg3wWI2L)。 - undefined

我认为应该是..., "sCC": "-", "Size": "1.818 TB", "Name": "one"。这个输出显示了sCC和Size的不同值，并且没有Name。 - undefined

@Brent：糟糕——我错误地将"sCC"和"Size"字段合并在一起了。感谢你指出了问题所在。 - undefined

尽管这样做可以起作用，但我修改了提出的解决方案，使其表明表格的值是可变的，并且没有固定的宽度。 - undefined

0

你可以尝试这样做：

import re

lines = re.split("\n", data)

for line in lines[8:-1]:
    fields = re.split("  +",line)
    print(fields)

其中data包含您的表格。跟踪的登录是将表格拆分为单行，然后使用两个或多个空格作为分隔符将每行拆分为字段（注意re.split(" +", line)）。诀窍是从第8行开始，并在最后一行结束。

一旦您将单行拆分为包含字段的列表，构建字典就很简单了。

- Gianluca

-1

import re

table = '''Virtual Drives = 4

VD LIST :
=======

----------------------------------------------------------
DG/VD TYPE  State Access Consist Cache sCC     Size Name
----------------------------------------------------------
0/0   RAID1 Optl  RW     No      RWTD  -   1.818 TB one
1/1   RAID1 Optl  RW     No      RWTD  -   1.818 TB two
2/2   RAID1 Optl  RW     No      RWTD  -   1.818 TB three
3/3   RAID1 Optl  RW     No      RWTD  -   1.818 TB four
----------------------------------------------------------'''
table = table.split('\n')

result = []
header = None
divider_pattern = re.compile('^[-]{20,}$')
for i, row in enumerate(table):
    row = row.strip()
    if divider_pattern.match(row) and not header:
        header = table[i + 1].split()
        continue
    if header and not divider_pattern.match(row):
        row = row.split()
        if row != header:
            result.append(dict(zip(header, row)))

print result

- Cody Bouche

-1

有一个Python库可以完成这个任务：http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_fwf.html

 import pandas as pd

 data = pd.read_fwf('filename.txt')

但是你必须将表格预处理为这种格式：

DG/VD TYPE  State Access Consist Cache sCC     Size Units Name
0/0   RAID1 Optl  RW     No      RWTD  -   1.818 TB one
1/1   RAID1 Optl  RW     No      RWTD  -   1.818 TB two
2/2   RAID1 Optl  RW     No      RWTD  -   1.818 TB three
3/3   RAID1 Optl  RW     No      RWTD  -   1.818 TB four

- Davoud Taghawi-Nejad

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Padraic Cunningham · Accepted Answer

from itertools import dropwhile, takewhile
with open("test.txt") as f:
    dp = dropwhile(lambda x: not x.startswith("-"), f)
    next(dp)  # skip ----
    names = next(dp).split()  # get headers names
    next(f)  # skip -----
    out = []
    for line in takewhile(lambda x: not x.startswith("-"), f):
        a, b = line.rsplit(None, 1)
        out.append(dict(zip(names, a.split(None, 7) + [b])))]

输出：

from pprint import  pprint as pp

pp(out)
[{'Access': 'RW',
  'Cache': 'RWTD',
  'Consist': 'No',
  'DG/VD': '0/0',
  'Name': 'one',
  'Size': '1.818 TB',
  'State': 'Optl',
  'TYPE': 'RAID1',
  'sCC': '-'},
 {'Access': 'RW',
  'Cache': 'RWTD',
  'Consist': 'No',
  'DG/VD': '1/1',
  'Name': 'two',
  'Size': '1.818 TB',
  'State': 'Optl',
  'TYPE': 'RAID1',
  'sCC': '-'},
 {'Access': 'RW',
  'Cache': 'RWTD',
  'Consist': 'No',
  'DG/VD': '2/2',
  'Name': 'three',
  'Size': '1.818 TB',
  'State': 'Optl',
  'TYPE': 'RAID1',
  'sCC': '-'},
 {'Access': 'RW',
  'Cache': 'RWTD',
  'Consist': 'No',
  'DG/VD': '3/3',
  'Name': 'four',
  'Size': '1.818 TB',
  'State': 'Optl',
  'TYPE': 'RAID1',
  'sCC': '-'}]

如果你想要保持顺序，请使用OrderedDict。

out = [OrderedDict(zip(names, line.split()))
           for line in takewhile(lambda x: not x.startswith("-"), f)]

根据您的编辑，缺少名称值的情况如下：

from itertools import dropwhile, takewhile

with open("test.txt") as f:
    dp = dropwhile(lambda x: not x.startswith("-"), f)
    next(dp)  # skip ----
    names = next(dp).split()  # get headers names
    next(f)  # skip -----
    out = []
    for line in takewhile(lambda x: not x.startswith("-"), f):
        a, b = line.rsplit(" ", 1)
        out.append(dict(zip(names,  a.rstrip().split(None, 7) + [b.rstrip()])))

输出：

[{'Access': 'RW',
  'Cache': 'RWTD',
  'Consist': 'No',
  'DG/VD': '0/0',
  'Name': 'one',
  'Size': '1.818 TB',
  'State': 'Optl',
  'TYPE': 'RAID1',
  'sCC': '-'},
 {'Access': 'RW',
  'Cache': 'RWTD',
  'Consist': 'No',
  'DG/VD': '1/1',
  'Name': 'two',
  'Size': '1.818 TB',
  'State': 'Optl',
  'TYPE': 'RAID1',
  'sCC': '-'},
 {'Access': 'RW',
  'Cache': 'RWTD',
  'Consist': 'No',
  'DG/VD': '2/2',
  'Name': 'three',
  'Size': '1.818 TB',
  'State': 'Optl',
  'TYPE': 'RAID1',
  'sCC': '-'},
 {'Access': 'RW',
  'Cache': 'RWTD',
  'Consist': 'No',
  'DG/VD': '3/3',
  'Name': 'four',
  'Size': '1.818 TB',
  'State': 'Optl',
  'TYPE': 'RAID1',
  'sCC': '-'},
 {'Access': 'RW',
  'Cache': 'RWTD',
  'Consist': 'No',
  'DG/VD': '4/4',
  'Name': '',
  'Size': '4.681 TB',
  'State': 'Reblg',
  'TYPE': 'RAID10',
  'sCC': '-'}]

这也将处理TB和名称列值之间有多个空格的行 1.818 TB one