在Python 3中迭代单个字节

Question

在Python 3中迭代单个字节

73

在Python 3中迭代bytes对象时，会将每个bytes作为int返回：

>>> [b for b in b'123']
[49, 50, 51]

如何获得长度为1的 bytes 对象呢？

以下方法可能可行，但对读者来说并不是很明显，并且可能性能不佳：

```python b = b'x'[0:1] ```

>>> [bytes([b]) for b in b'123']
[b'1', b'2', b'3']

- flying sheep

我想知道一个数组对象是否更适合您的目的，并避免不必要的转换。 - Mayur Patel

1

表现相同，或者你是什么意思？>>>[b for b in bytearray(b"123")] ⇒ [49, 50, 51] - flying sheep

1

我不认为Python中有明显的“字符”类型。如果您查看数组模块的文档，您会发现Python中的“字符”是1字节整数。因此，您看到的结果是一致的。然而，我建议使用数组（没有完全了解您的应用程序），以建议它将避免使用列表可能发生的不必要的类型转换和对象构造。我怀疑即使字符串也会导致额外的工作，但我不确定。正如其他人所指出的那样，您可以使用索引来提取所需的项。 - Mayur Patel

当你说“array”时，你是指“bytearray”吗？ - flying sheep

2

有人知道为什么Python3返回整数吗？我个人更喜欢Python2的行为。 - guettli

1

因为这就是字节串的定义：一系列从0到255的数字，可以用来表示任何类型的数据。 - flying sheep

7个回答

30

int.to_bytes

int对象有一个 to_bytes 方法，可用于将 int 转换为其对应的字节表示：

>>> import sys
>>> [i.to_bytes(1, sys.byteorder) for i in b'123']
[b'1', b'2', b'3']

和其他一些答案一样，不清楚这是否比原来的解决方案更易读：我认为长度和字节顺序参数使它变得更加混乱。

struct.unpack

另一种方法是使用struct.unpack，但除非您熟悉结构模块，否则可能也难以阅读：

>>> import struct
>>> struct.unpack('3c', b'123')
(b'1', b'2', b'3')

(正如评论中jfs所观察到的，可以动态构建struct.unpack的格式字符串；在这种情况下，我们知道结果中单个字节的数量必须等于原始字节串中的字节数，因此struct.unpack(str(len(bytestring)) + 'c', bytestring)是可能的。)

性能

>>> import random, timeit
>>> bs = bytes(random.randint(0, 255) for i in range(100))

>>> # OP's solution
>>> timeit.timeit(setup="from __main__ import bs",
                  stmt="[bytes([b]) for b in bs]")
46.49886950897053

>>> # Accepted answer from jfs
>>> timeit.timeit(setup="from __main__ import bs",
                  stmt="[bs[i:i+1] for i in range(len(bs))]")
20.91463226894848

>>>  # Leon's answer
>>> timeit.timeit(setup="from __main__ import bs", 
                  stmt="list(map(bytes, zip(bs)))")
27.476876026019454

>>> # guettli's answer
>>> timeit.timeit(setup="from __main__ import iter_bytes, bs",        
                  stmt="list(iter_bytes(bs))")
24.107485140906647

>>> # user38's answer (with Leon's suggested fix)
>>> timeit.timeit(setup="from __main__ import bs", 
                  stmt="[chr(i).encode('latin-1') for i in bs]")
45.937552741961554

>>> # Using int.to_bytes
>>> timeit.timeit(setup="from __main__ import bs;from sys import byteorder", 
                  stmt="[x.to_bytes(1, byteorder) for x in bs]")
32.197659170022234

>>> # Using struct.unpack, converting the resulting tuple to list
>>> # to be fair to other methods
>>> timeit.timeit(setup="from __main__ import bs;from struct import unpack", 
                  stmt="list(unpack('100c', bs))")
1.902243083808571

struct.unpack 似乎比其他方法快至少一个数量级，可能是因为它在字节级别上运作。另一方面，int.to_bytes 的表现比大多数“显而易见”的方法要差。

- snakecharmerb

好答案。它绝对值得获得悬赏。 - Leon

@Leon FWIW，我认为你的答案是最Pythonic的；我猜赏金的归宿取决于出钱人是想要可读性还是性能 :) （或者更多、更好的答案的表象）。 - snakecharmerb

12

我认为比较不同方法的运行时间可能会很有用，因此我进行了基准测试（使用我的库simple_benchmark）：

对于大字节对象，NumPy解决方案毫无疑问是最快的解决方案。

但如果需要一个结果列表，则NumPy解决方案（带有tolist()）和struct解决方案都比其他替代方案快得多。

我没有包括guettlis的答案，因为它与jfs的解决方案几乎相同，只是使用生成器函数代替了推导式。

import numpy as np
import struct
import sys

from simple_benchmark import BenchmarkBuilder
b = BenchmarkBuilder()

@b.add_function()
def jfs(bytes_obj):
    return [bytes_obj[i:i+1] for i in range(len(bytes_obj))]

@b.add_function()
def snakecharmerb_tobytes(bytes_obj):
    return [i.to_bytes(1, sys.byteorder) for i in bytes_obj]

@b.add_function()
def snakecharmerb_struct(bytes_obj):
    return struct.unpack(str(len(bytes_obj)) + 'c', bytes_obj)

@b.add_function()
def Leon(bytes_obj):
    return list(map(bytes, zip(bytes_obj)))

@b.add_function()
def rusu_ro1_format(bytes_obj):
    return [b'%c' % i for i in bytes_obj]

@b.add_function()
def rusu_ro1_numpy(bytes_obj):
    return np.frombuffer(bytes_obj, dtype='S1')

@b.add_function()
def rusu_ro1_numpy_tolist(bytes_obj):
    return np.frombuffer(bytes_obj, dtype='S1').tolist()

@b.add_function()
def User38(bytes_obj):
    return [chr(i).encode() for i in bytes_obj]

@b.add_arguments('byte object length')
def argument_provider():
    for exp in range(2, 18):
        size = 2**exp
        yield size, b'a' * size

r = b.run()
r.plot()

- MSeifert

1

漂亮的图表。在我的当前情境中，性能完全不重要。它应该能够工作，并且代码应该看起来易读易懂。 - guettli

1

注意：rusu_ro1_numpy实际上并没有“迭代单个字节”（基准测试显示它甚至不复制字节——时间是恒定的——为什么我们需要一个numpy数组？bytes_obj已经是可迭代的（在int上））。如果可接受一个（在bytes上）的可迭代解决方案，那么您的基准测试表明snakecharmerb_struct是最快的（尽管它复制了字节，但它不会“迭代”）。基准测试表明，在迭代单个字节的解决方案中，bytes_obj[i:i+1]变体是最快的。 - jfs

@jfs 是的，没错。NumPy和struct解决方案只是将可迭代对象表示为字节，而不是对其进行迭代。然而，这些解决方案获得了几个赞，因此排除它们可能是不公平的，但也许我应该更详细地讨论它们之间的差异。也许我会在接下来的几天里找到时间修订答案。谢谢。 - MSeifert

11

自Python 3.5开始，您可以使用%格式化为字节和字节数组：

[b'%c' % i for i in b'123']

输出：

[b'1', b'2', b'3']

如果你想要更快的解决方案，我建议使用numpy.frombuffer，上面的解决方案比你最初的方法快2-3倍：

import numpy as np
np.frombuffer(b'123', dtype='S1')

输出：

array([b'1', b'2', b'3'], 
      dtype='|S1')

第二种解决方案比struct.unpack快大约10%（我使用了与 @snakecharmerb 相同的性能测试，对100个随机字节进行测试）

- kederrac

7

我使用这个帮助方法：

def iter_bytes(my_bytes):
    for i in range(len(my_bytes)):
        yield my_bytes[i:i+1]

适用于Python2和Python3。

- guettli

7

一组 map(), bytes() 和 zip() 函数即可完成操作：

>>> list(map(bytes, zip(b'123')))
[b'1', b'2', b'3']

然而，我认为它不比[bytes([b]) for b in b'123']更易读，也没有更好的性能表现。

- Leon

1

一种简短的方法是这样做：

[bytes([i]) for i in b'123\xaa\xbb\xcc\xff']

- user38

3

如果输入的bytes对象包含128-255范围内的值，则无法正常工作。您需要使用latin-1（与iso-8859-1相同）编码来解决这个问题：[chr(i).encode('latin-1') for i in b'\x80\xb2\xff'] - Leon

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jfs · Accepted Answer

如果您关注此代码的性能并且将int作为字节不是适合您的接口，则应该重新考虑您使用的数据结构，例如改用str对象。

您可以切片bytes对象以获取1长度的bytes对象：

L = [bytes_obj[i:i+1] for i in range(len(bytes_obj))]

有一个PEP 0467 -- 二进制序列的轻微API改进建议引入bytes.iterbytes()方法：

>>> list(b'123'.iterbytes())
[b'1', b'2', b'3']