如何有效地解析固定宽度文件?

91
我正在尝试寻找一种有效的方法来解析包含固定宽度行的文件。例如,前20个字符表示一列,从21:30另一列依此类推。
假设每行有100个字符,那么将一行解析为多个组件的有效方法是什么?
我可以对每行使用字符串切片,但如果行很长,这样做会有点丑陋。是否还有其他快速的方法?
11个回答

80

使用Python标准库的struct模块也很容易而且速度相当快,因为它是用C编写的。以下代码展示了如何使用它。此外,通过指定字段中字符数的负值,还可以跳过字符列。

import struct

fieldwidths = (2, -10, 24)
fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's') for fw in fieldwidths)

# Convert Unicode input to bytes and the result back to Unicode string.
unpack = struct.Struct(fmtstring).unpack_from  # Alias.
parse = lambda line: tuple(s.decode() for s in unpack(line.encode()))

print('fmtstring: {!r}, record size: {} chars'.format(fmtstring, struct.calcsize(fmtstring)))

line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fields = parse(line)
print('fields: {}'.format(fields))

输出:

fmtstring: '2s 10x 24s', recsize: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')

以下是使用字符串切片的一种方法,正如你所考虑的那样,但你担心它可能会变得太丑陋。它确实有些复杂,速度方面与基于struct模块的版本大致相同——尽管我有一个想法可以加快速度(这可能使额外的复杂性变得值得)。请参见下面关于该主题的更新。

from itertools import zip_longest
from itertools import accumulate

def make_parser(fieldwidths):
    cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
    pads = tuple(fw < 0 for fw in fieldwidths) # bool values for padding fields
    flds = tuple(zip_longest(pads, (0,)+cuts, cuts))[:-1]  # ignore final one
    parse = lambda line: tuple(line[i:j] for pad, i, j in flds if not pad)
    # Optional informational function attributes.
    parse.size = sum(abs(fw) for fw in fieldwidths)
    parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                                                for fw in fieldwidths)
    return parse

line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fieldwidths = (2, -10, 24)  # negative widths represent ignored padding fields
parse = make_parser(fieldwidths)
fields = parse(line)
print('format: {!r}, rec size: {} chars'.format(parse.fmtstring, parse.size))
print('fields: {}'.format(fields))

输出:

format: '2s 10x 24s', rec size: 36 chars
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')

更新

正如我所怀疑的,有一种方法可以使使用字符串切片的版本的代码更快 - 在Python 2.7中,它使其与使用struct版本的速度大致相同,但在Python 3.x中使其加快了233%(以及未经优化版本本身,其速度与struct版本大致相同)。

上面展示的版本所做的是定义一个lambda函数,主要是一个生成多个运行时切片限制的推导式。

parse = lambda line: tuple(line[i:j] for pad, i, j in flds if not pad)

根据 for 循环中 ij 的值,等价于以下语句之一:

如果符合条件,则为以下形式:

parse = lambda line: tuple(line[0:2], line[12:36], line[36:51], ...)

然而,由于切片边界都是常量,后者的执行速度比前者快了两倍以上。

幸运的是,使用内置的eval()函数将前者转换和“编译”成后者相对容易:

def make_parser(fieldwidths):
    cuts = tuple(cut for cut in accumulate(abs(fw) for fw in fieldwidths))
    pads = tuple(fw < 0 for fw in fieldwidths) # bool flags for padding fields
    flds = tuple(zip_longest(pads, (0,)+cuts, cuts))[:-1]  # ignore final one
    slcs = ', '.join('line[{}:{}]'.format(i, j) for pad, i, j in flds if not pad)
    parse = eval('lambda line: ({})\n'.format(slcs))  # Create and compile source code.
    # Optional informational function attributes.
    parse.size = sum(abs(fw) for fw in fieldwidths)
    parse.fmtstring = ' '.join('{}{}'.format(abs(fw), 'x' if fw < 0 else 's')
                                                for fw in fieldwidths)
    return parse

3
那对于 Unicode 或 UTF-8 编码的字符串,该怎么办呢?struct.unpack 似乎只能处理二进制数据。我无法让它运作。 - Reiner Gerecke
3
@Reiner Gerecke:struct模块设计用于操作二进制数据。具有固定宽度字段的文件是传统工作,很可能是在UTF-8之前就存在的(即使不是在时间上也是在思维上)。从文件中读取的字节是二进制数据。文件中没有Unicode。您需要解码字节以获得Unicode。 - John Machin
1
@Reiner Gerecke:澄清一下:在那些传统的文件格式中,每个字段都是固定数量的字节,而不是固定数量的字符。虽然不太可能使用UTF-8编码,但它们可以使用每个字符具有可变字节数的编码进行编码,例如gbk、big5、euc-jp、shift-jis等。如果您希望使用unicode工作,则不能一次解码整个记录;您需要逐个解码每个字段。 - John Machin
2
当您尝试将此应用于Unicode值(例如在Python 3中)并且文本超出ASCII字符集时,这完全崩溃了,其中“固定宽度”意味着“固定数量的字符”,而不是字节。 - Martijn Pieters
1
@martineau:感谢您的更新!是的,使用切片应该可以使此代码基于代码点而不是字节来工作。 :-) - Martijn Pieters
显示剩余15条评论

75

我不确定这是否高效,但它应该易于阅读(而不是手动切片)。我定义了一个名为slices的函数,用于获取一个字符串和列长度,并返回子串。我将其设置为生成器,因此对于非常长的行,它不会构建子串的临时列表。

def slices(s, *args):
    position = 0
    for length in args:
        yield s[position:position + length]
        position += length

例子

In [32]: list(slices('abcdefghijklmnopqrstuvwxyz0123456789', 2))
Out[32]: ['ab']

In [33]: list(slices('abcdefghijklmnopqrstuvwxyz0123456789', 2, 10, 50))
Out[33]: ['ab', 'cdefghijkl', 'mnopqrstuvwxyz0123456789']

In [51]: d,c,h = slices('dogcathouse', 3, 3, 5)
In [52]: d,c,h
Out[52]: ('dog', 'cat', 'house')

但是,我认为如果需要一次性获取所有列,则生成器的优势将会失去。其中一个好处是当您想逐个处理列时可以受益,例如在循环中。


2
据我所知,这种方法比“struct”慢,但易读且更易处理。我使用了您的“slices函数”,“struct”模块和“re”模块进行了一些测试,结果表明对于大文件,“struct”是最快的,“re”排名第二(慢1.5倍),而“slices”排名第三(慢2倍)。然而,使用“struct”会有一些小的开销,因此在较小的文件上,您的“slices函数”可能会更快。 - YeO

34

有两个比已经提到的解决方案更简单和更美观的选项:

第一个选项是使用pandas:

import pandas as pd

path = 'filename.txt'

#inferred - as suggested in the comments by James Paul Mason
data = pd.read_fwf(path, colspecs='infer')

# Or using Pandas with a column specification
col_specification = [(0, 20), (21, 30), (31, 50), (51, 100)]
data = pd.read_fwf(path, colspecs=col_specification)

第二种选项使用numpy.loadtxt:

import numpy as np

# Using NumPy and letting it figure it out automagically
data_also = np.loadtxt(path)

这真的取决于你希望以什么方式使用你的数据。


这个答案在速度方面与被接受的答案相比具有竞争力吗? - asachet
1
虽然我还没有测试过,但是它应该比被接受的答案要快很多。 - Tom M
2
如果你设置colspecs='infer',Pandas可以自动检测。https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_fwf.html - James Paul Mason
谢谢,我回答时似乎没有这个选项,我会更新一下。 - Tom M
这个问题没有明确指定一个效率度量标准,而这个答案在人类处理方面肯定是最高效的,即可读性和代码复杂性。 - undefined

15

以下代码提供了一个概述,如果你需要处理一些严格的固定列宽文件,你可能会想要做些什么。

"严格" = 多个记录类型在多个文件类型中,每个记录最多1000字节,布局定义者和"对抗"生产者/消费者是一个带有态度的政府部门,布局更改导致未使用的列,一个文件中最多有一百万条记录...

特征:预编译结构格式。忽略不想要的列。将输入字符串转换为所需的数据类型(草图省略了错误处理)。将记录转换为对象实例(或字典、命名元组,如果你喜欢)。

代码:

import struct, datetime, io, pprint

# functions for converting input fields to usable data
cnv_text = rstrip
cnv_int = int
cnv_date_dmy = lambda s: datetime.datetime.strptime(s, "%d%m%Y") # ddmmyyyy
# etc

# field specs (field name, start pos (1-relative), len, converter func)
fieldspecs = [
    ('surname', 11, 20, cnv_text),
    ('given_names', 31, 20, cnv_text),
    ('birth_date', 51, 8, cnv_date_dmy),
    ('start_date', 71, 8, cnv_date_dmy),
    ]

fieldspecs.sort(key=lambda x: x[1]) # just in case

# build the format for struct.unpack
unpack_len = 0
unpack_fmt = ""
for fieldspec in fieldspecs:
    start = fieldspec[1] - 1
    end = start + fieldspec[2]
    if start > unpack_len:
        unpack_fmt += str(start - unpack_len) + "x"
    unpack_fmt += str(end - start) + "s"
    unpack_len = end
field_indices = range(len(fieldspecs))
print unpack_len, unpack_fmt
unpacker = struct.Struct(unpack_fmt).unpack_from

class Record(object):
    pass
    # or use named tuples

raw_data = """\
....v....1....v....2....v....3....v....4....v....5....v....6....v....7....v....8
          Featherstonehaugh   Algernon Marmaduke  31121969            01012005XX
"""

f = cStringIO.StringIO(raw_data)
headings = f.next()
for line in f:
    # The guts of this loop would of course be hidden away in a function/method
    # and could be made less ugly
    raw_fields = unpacker(line)
    r = Record()
    for x in field_indices:
        setattr(r, fieldspecs[x][0], fieldspecs[x][3](raw_fields[x]))
    pprint.pprint(r.__dict__)
    print "Customer name:", r.given_names, r.surname

输出:

78 10x20s20s8s12x8s
{'birth_date': datetime.datetime(1969, 12, 31, 0, 0),
 'given_names': 'Algernon Marmaduke',
 'start_date': datetime.datetime(2005, 1, 1, 0, 0),
 'surname': 'Featherstonehaugh'}
Customer name: Algernon Marmaduke Featherstonehaugh

如何更新此代码以解析大于1000字节的记录?我遇到了这个错误:struct.error: unpack_from需要至少1157字节的缓冲区 - chris__allen

4
> str = '1234567890'
> w = [0,2,5,7,10]
> [ str[ w[i-1] : w[i] ] for i in range(1,len(w)) ]
['12', '345', '67', '890']

4
这是我使用一个包含起始和结束位置的字典来解决问题。提供起点和终点有助于管理列长度的变化。
# fixed length
#      '---------- ------- ----------- -----------'
line = '20.06.2019 myname  active      mydevice   '
SLICES = {'date_start': 0,
         'date_end': 10,
         'name_start': 11,
         'name_end': 18,
         'status_start': 19,
         'status_end': 30,
         'device_start': 31,
         'device_end': 42}

def get_values_as_dict(line, SLICES):
    values = {}
    key_list = {key.split("_")[0] for key in SLICES.keys()}
    for key in key_list:
       values[key] = line[SLICES[key+"_start"]:SLICES[key+"_end"]].strip()
    return values

>>> print (get_values_as_dict(line,SLICES))
{'status': 'active', 'name': 'myname', 'date': '20.06.2019', 'device': 'mydevice'}

2
这是NumPy在幕后使用的代码(非常简化,但仍然可以在“_iotools模块”中的“LineSplitter类”中找到):

import numpy as np

DELIMITER = (20, 10, 10, 20, 10, 10, 20)

idx = np.cumsum([0] + list(DELIMITER))
slices = [slice(i, j) for (i, j) in zip(idx[:-1], idx[1:])]

def parse(line):
    return [line[s] for s in slices]

它不处理负数分隔符以忽略列,因此它不像struct那样灵活,但速度更快。


2
这是一个简单的 Python 3 模块,基于John Machin's answer- 根据需要进行调整 :)
"""
fixedwidth

Parse and iterate through a fixedwidth text file, returning record objects.

Adapted from https://dev59.com/aW445IYBdhLWcg3wWI2L#4916375


USAGE

    import fixedwidth, pprint

    # define the fixed width fields we want
    # fieldspecs is a list of [name, description, start, width, type] arrays.
    fieldspecs = [
        ["FILEID", "File Identification", 1, 6, "A/N"],
        ["STUSAB", "State/U.S. Abbreviation (USPS)", 7, 2, "A"],
        ["SUMLEV", "Summary Level", 9, 3, "A/N"],
        ["LOGRECNO", "Logical Record Number", 19, 7, "N"],
        ["POP100", "Population Count (100%)", 30, 9, "N"],
    ]

    # define the fieldtype conversion functions
    fieldtype_fns = {
        'A': str.rstrip,
        'A/N': str.rstrip,
        'N': int,
    }

    # iterate over record objects in the file
    with open(f, 'rb'):
        for record in fixedwidth.reader(f, fieldspecs, fieldtype_fns):
            pprint.pprint(record.__dict__)

    # output:
    {'FILEID': 'SF1ST', 'LOGRECNO': 2, 'POP100': 1, 'STUSAB': 'TX', 'SUMLEV': '040'}
    {'FILEID': 'SF1ST', 'LOGRECNO': 3, 'POP100': 2, 'STUSAB': 'TX', 'SUMLEV': '040'}    
    ...

"""

import struct, io


# fieldspec columns
iName, iDescription, iStart, iWidth, iType = range(5)


def get_struct_unpacker(fieldspecs):
    """
    Build the format string for struct.unpack to use, based on the fieldspecs.
    fieldspecs is a list of [name, description, start, width, type] arrays.
    Returns a string like "6s2s3s7x7s4x9s".
    """
    unpack_len = 0
    unpack_fmt = ""
    for fieldspec in fieldspecs:
        start = fieldspec[iStart] - 1
        end = start + fieldspec[iWidth]
        if start > unpack_len:
            unpack_fmt += str(start - unpack_len) + "x"
        unpack_fmt += str(end - start) + "s"
        unpack_len = end
    struct_unpacker = struct.Struct(unpack_fmt).unpack_from
    return struct_unpacker


class Record(object):
    pass
    # or use named tuples


def reader(f, fieldspecs, fieldtype_fns):
    """
    Wrap a fixedwidth file and return records according to the given fieldspecs.
    fieldspecs is a list of [name, description, start, width, type] arrays.
    fieldtype_fns is a dictionary of functions used to transform the raw string values, 
    one for each type.
    """

    # make sure fieldspecs are sorted properly
    fieldspecs.sort(key=lambda fieldspec: fieldspec[iStart])

    struct_unpacker = get_struct_unpacker(fieldspecs)

    field_indices = range(len(fieldspecs))

    for line in f:
        raw_fields = struct_unpacker(line) # split line into field values
        record = Record()
        for i in field_indices:
            fieldspec = fieldspecs[i]
            fieldname = fieldspec[iName]
            s = raw_fields[i].decode() # convert raw bytes to a string
            fn = fieldtype_fns[fieldspec[iType]] # get conversion function
            value = fn(s) # convert string to value (eg to an int)
            setattr(record, fieldname, value)
        yield record


if __name__=='__main__':

    # test module

    import pprint, io

    # define the fields we want
    # fieldspecs are [name, description, start, width, type]
    fieldspecs = [
        ["FILEID", "File Identification", 1, 6, "A/N"],
        ["STUSAB", "State/U.S. Abbreviation (USPS)", 7, 2, "A"],
        ["SUMLEV", "Summary Level", 9, 3, "A/N"],
        ["LOGRECNO", "Logical Record Number", 19, 7, "N"],
        ["POP100", "Population Count (100%)", 30, 9, "N"],
    ]

    # define a conversion function for integers
    def to_int(s):
        """
        Convert a numeric string to an integer.
        Allows a leading ! as an indicator of missing or uncertain data.
        Returns None if no data.
        """
        try:
            return int(s)
        except:
            try:
                return int(s[1:]) # ignore a leading !
            except:
                return None # assume has a leading ! and no value

    # define the conversion fns
    fieldtype_fns = {
        'A': str.rstrip,
        'A/N': str.rstrip,
        'N': to_int,
        # 'N': int,
        # 'D': lambda s: datetime.datetime.strptime(s, "%d%m%Y"), # ddmmyyyy
        # etc
    }

    # define a fixedwidth sample
    sample = """\
SF1ST TX04089000  00000023748        1 
SF1ST TX04090000  00000033748!       2
SF1ST TX04091000  00000043748!        
"""
    sample_data = sample.encode() # convert string to bytes
    file_like = io.BytesIO(sample_data) # create a file-like wrapper around bytes

    # iterate over record objects in the file
    for record in reader(file_like, fieldspecs, fieldtype_fns):
        # print(record)
        pprint.pprint(record.__dict__)

1
因为我的旧工作经常处理100万行的定长数据,所以当我开始使用Python时,我对这个问题进行了研究。
定长数据有两种类型:
1. ASCII定长(ASCII字符长度=1,双字节编码字符长度=2) 2. Unicode定长(ASCII字符和双字节编码字符长度均为1)
如果资源字符串完全由ASCII字符组成,则ASCII定长=Unicode定长。
幸运的是,在py3中,字符串和字节是不同的,这减少了在处理双字节编码字符(例如gbk、big5、euc-jp、shift-jis等)时的许多混淆。对于“ASCII定长”的处理,通常将字符串转换为字节,然后进行分割。
不导入第三方模块的情况下,总行数为100万,每行长度为800字节,定长参数为(10,25,4,....),我用大约5种方式拆分行,并得出以下结论:
  1. struct是最快的(1x)
  2. 仅循环,不进行预处理FixedWidthArgs是最慢的(5x+)
  3. slice(bytes)slice(string)更快
  4. 源字符串是字节测试结果:struct(1x),operator.itemgetter(1.7x),预编译的sliceObject和列表推导式(2.8x),re.patten对象(2.9x)

在处理大文件时,我们经常使用with open ( file, "rb") as f:
该方法遍历上述文件之一,大约需要2.4秒。
我认为适当的处理程序可以处理100万行数据,将每行分成20个字段,并且需要少于2.4秒。

我只发现stuctitemgetter符合要求

附注:为了正常显示,我将unicode str转换为bytes。如果您处于双字节环境中,则不需要执行此操作。

from itertools import accumulate
from operator import itemgetter

def oprt_parser(sArgs):
    sum_arg = tuple(accumulate(abs(i) for i in sArgs))
    # Negative parameter field index
    cuts = tuple(i for i,num in enumerate(sArgs) if num < 0)
    # Get slice args and Ignore fields of negative length
    ig_Args = tuple(item for i, item in enumerate(zip((0,)+sum_arg,sum_arg)) if i not in cuts)
    # Generate `operator.itemgetter` object
    oprtObj =itemgetter(*[slice(s,e) for s,e in ig_Args])
    return oprtObj

lineb = b'abcdefghijklmnopqrstuvwxyz\xb0\xa1\xb2\xbb\xb4\xd3\xb5\xc4\xb6\xee\xb7\xa2\xb8\xf6\xba\xcd0123456789'
line = lineb.decode("GBK")

# Unicode Fixed Width
fieldwidthsU = (13, -13, 4, -4, 5,-5) # Negative width fields is ignored
# ASCII Fixed Width
fieldwidths = (13, -13, 8, -8, 5,-5) # Negative width fields is ignored
# Unicode FixedWidth processing
parse = oprt_parser(fieldwidthsU)
fields = parse(line)
print('Unicode FixedWidth','fields: {}'.format(tuple(map(lambda s: s.encode("GBK"), fields))))
# ASCII FixedWidth processing
parse = oprt_parser(fieldwidths)
fields = parse(lineb)
print('ASCII FixedWidth','fields: {}'.format(fields))
line = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789\n'
fieldwidths = (2, -10, 24)
parse = oprt_parser(fieldwidths)
fields = parse(line)
print(f"fields: {fields}")

输出:

Unicode FixedWidth fields: (b'abcdefghijklm', b'\xb0\xa1\xb2\xbb\xb4\xd3\xb5\xc4', b'01234')
ASCII FixedWidth fields: (b'abcdefghijklm', b'\xb0\xa1\xb2\xbb\xb4\xd3\xb5\xc4', b'01234')
fields: ('AB', 'MNOPQRSTUVWXYZ0123456789')

oprt_parser 是 4 倍的 make_parser(列表推导式 + 切片)


在研究过程中发现,当 CPU 速度更快时,re 方法的效率似乎增长得更快。
由于我没有更多和更好的计算机进行测试,因此提供了我的测试代码,如果有人感兴趣,可以使用更快的计算机进行测试。
运行环境:
- 操作系统:win10 - Python 版本:3.7.2 - CPU:amd athlon x3 450 - 硬盘:seagate 1T
import timeit
import time
import re
from itertools import accumulate
from operator import itemgetter

def eff2(stmt,onlyNum= False,showResult=False):
    '''test function'''
    if onlyNum:
        rl = timeit.repeat(stmt=stmt,repeat=roundI,number=timesI,globals=globals())
        avg = sum(rl) / len(rl)
        return f"{avg * (10 ** 6)/timesI:0.4f}"
    else:
        rl = timeit.repeat(stmt=stmt,repeat=10,number=1000,globals=globals())
        avg = sum(rl) / len(rl)
        print(f"【{stmt}】")
        print(f"\tquick avg = {avg * (10 ** 6)/1000:0.4f} s/million")
        if showResult:
            print(f"\t  Result = {eval(stmt)}\n\t  timelist = {rl}\n")
        else:
            print("")

def upDouble(argList,argRate):
    return [c*argRate for c in argList]

tbStr = "000000001111000002222真2233333333000000004444444QAZ55555555000000006666666ABC这些事中文字abcdefghijk"
tbBytes = tbStr.encode("GBK")
a20 = (4,4,2,2,2,3,2,2, 2 ,2,8,8,7,3,8,8,7,3, 12 ,11)
a20U = (4,4,2,2,2,3,2,2, 1 ,2,8,8,7,3,8,8,7,3, 6 ,11)
Slng = 800
rateS = Slng // 100

tStr = "".join(upDouble(tbStr , rateS))
tBytes = tStr.encode("GBK")
spltArgs = upDouble( a20 , rateS)
spltArgsU = upDouble( a20U , rateS)

testList = []
timesI = 100000
roundI = 5
print(f"test round = {roundI} timesI = {timesI} sourceLng = {len(tStr)} argFieldCount = {len(spltArgs)}")


print(f"pure str \n{''.ljust(60,'-')}")
# ==========================================
def str_parser(sArgs):
    def prsr(oStr):
        r = []
        r_ap = r.append
        stt=0
        for lng in sArgs:
            end = stt + lng 
            r_ap(oStr[stt:end])
            stt = end 
        return tuple(r)
    return prsr

Str_P = str_parser(spltArgsU)
# eff2("Str_P(tStr)")
testList.append("Str_P(tStr)")

print(f"pure bytes \n{''.ljust(60,'-')}")
# ==========================================
def byte_parser(sArgs):
    def prsr(oBytes):
        r, stt = [], 0
        r_ap = r.append
        for lng in sArgs:
            end = stt + lng
            r_ap(oBytes[stt:end])
            stt = end
        return r
    return prsr
Byte_P = byte_parser(spltArgs)
# eff2("Byte_P(tBytes)")
testList.append("Byte_P(tBytes)")

# re,bytes
print(f"re compile object \n{''.ljust(60,'-')}")
# ==========================================


def rebc_parser(sArgs,otype="b"):
    re_Args = "".join([f"(.{{{n}}})" for n in sArgs])
    if otype == "b":
        rebc_Args = re.compile(re_Args.encode("GBK"))
    else:
        rebc_Args = re.compile(re_Args)
    def prsr(oBS):
        return rebc_Args.match(oBS).groups()
    return prsr
Rebc_P = rebc_parser(spltArgs)
# eff2("Rebc_P(tBytes)")
testList.append("Rebc_P(tBytes)")

Rebc_Ps = rebc_parser(spltArgsU,"s")
# eff2("Rebc_Ps(tStr)")
testList.append("Rebc_Ps(tStr)")


print(f"struct \n{''.ljust(60,'-')}")
# ==========================================

import struct
def struct_parser(sArgs):
    struct_Args = " ".join(map(lambda x: str(x) + "s", sArgs))
    def prsr(oBytes):
        return struct.unpack(struct_Args, oBytes)
    return prsr
Struct_P = struct_parser(spltArgs)
# eff2("Struct_P(tBytes)")
testList.append("Struct_P(tBytes)")

print(f"List Comprehensions + slice \n{''.ljust(60,'-')}")
# ==========================================
import itertools
def slice_parser(sArgs):
    tl = tuple(itertools.accumulate(sArgs))
    slice_Args = tuple(zip((0,)+tl,tl))
    def prsr(oBytes):
        return [oBytes[s:e] for s, e in slice_Args]
    return prsr
Slice_P = slice_parser(spltArgs)
# eff2("Slice_P(tBytes)")
testList.append("Slice_P(tBytes)")

def sliceObj_parser(sArgs):
    tl = tuple(itertools.accumulate(sArgs))
    tl2 = tuple(zip((0,)+tl,tl))
    sliceObj_Args = tuple(slice(s,e) for s,e in tl2)
    def prsr(oBytes):
        return [oBytes[so] for so in sliceObj_Args]
    return prsr
SliceObj_P = sliceObj_parser(spltArgs)
# eff2("SliceObj_P(tBytes)")
testList.append("SliceObj_P(tBytes)")

SliceObj_Ps = sliceObj_parser(spltArgsU)
# eff2("SliceObj_Ps(tStr)")
testList.append("SliceObj_Ps(tStr)")


print(f"operator.itemgetter + slice object \n{''.ljust(60,'-')}")
# ==========================================

def oprt_parser(sArgs):
    sum_arg = tuple(accumulate(abs(i) for i in sArgs))
    cuts = tuple(i for i,num in enumerate(sArgs) if num < 0)
    ig_Args = tuple(item for i,item in enumerate(zip((0,)+sum_arg,sum_arg)) if i not in cuts)
    oprtObj =itemgetter(*[slice(s,e) for s,e in ig_Args])
    return oprtObj

Oprt_P = oprt_parser(spltArgs)
# eff2("Oprt_P(tBytes)")
testList.append("Oprt_P(tBytes)")

Oprt_Ps = oprt_parser(spltArgsU)
# eff2("Oprt_Ps(tStr)")
testList.append("Oprt_Ps(tStr)")

print("|".join([s.split("(")[0].center(11," ") for s in testList]))
print("|".join(["".center(11,"-") for s in testList]))
print("|".join([eff2(s,True).rjust(11," ") for s in testList]))

输出:

Test round = 5 timesI = 100000 sourceLng = 744 argFieldCount = 20
...
...
   Str_P | Byte_P | Rebc_P | Rebc_Ps | Struct_P | Slice_P | SliceObj_P|SliceObj_Ps| Oprt_P | Oprt_Ps
-----------|-----------|-----------|-----------|-- ---------|-----------|-----------|-----------|---- -------|-----------
     9.6315| 7.5952| 4.4187| 5.6867| 1.5123| 5.2915| 4.2673| 5.7121| 2.4713| 3.9051

@MartijnPieters 更高效的函数 - notback

0

字符串切片并不需要很丑陋,只要你保持有序。考虑将字段宽度存储在字典中,然后使用相关的名称来创建一个对象:

from collections import OrderedDict

class Entry:
    def __init__(self, line):

        name2width = OrderedDict()
        name2width['foo'] = 2
        name2width['bar'] = 3
        name2width['baz'] = 2

        pos = 0
        for name, width in name2width.items():

            val = line[pos : pos + width]
            if len(val) != width:
                raise ValueError("not enough characters: \'{}\'".format(line))

            setattr(self, name, val)
            pos += width

file = "ab789yz\ncd987wx\nef555uv"

entry = []

for line in file.split('\n'):
    entry.append(Entry(line))

print(entry[1].bar) # output: 987

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接