将字符串转换为数组的最快方法,我发现是:
arr = np.array([mystr]).view(dtype='U1')
另一种将字符串转换为基于Unicode代码点的数组的方法(较慢),基于
@Daniel Mesejo's comment:
arr = np.fromiter(mystr, dtype='U1', count=len(mystr))
查看
fromiter
的源代码,可以发现将
count
参数设置为字符串的长度会导致整个数组一次性分配,而不是执行多个重新分配。
要转换回字符串:
str(arr.view(dtype=f'U{arr.size}')[0])
对于大多数情况而言,最终将其转换成Python的
str
并不是必要的,因为
np.str_
是
str
的一个子类。
arr.view(dtype=f'U{arr.size}')[0]
附录:frombuffer与array的时间差
100
mystr = ''.join(chr(random.choice(range(1, 0x1000))) for _ in range(100))
%timeit np.array([mystr]).view(dtype='U1')
1.43 µs ± 27.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit np.frombuffer(bytearray(mystr, 'utf-32-le'), dtype='U1')
1.2 µs ± 9.06 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
10,000
mystr = ''.join(chr(random.choice(range(1, 0x1000))) for _ in range(10000))
%timeit np.array([mystr]).view(dtype='U1')
4.33 µs ± 13.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.frombuffer(bytearray(mystr, 'utf-32-le'), dtype='U1')
10.9 µs ± 29.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
One million.
mystr = ''.join(chr(random.choice(range(1, 0x1000))) for _ in range(1000000))
%timeit np.array([mystr]).view(dtype='U1')
672 µs ± 1.64 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.frombuffer(bytearray(mystr, 'utf-32-le'), dtype='U1')
732 µs ± 5.22 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
np.frombuffer(mystr, dtype=np.uint8)
? - Divakar.view(np.uint32)
部分,因为它相当无关紧要。 - Mad Physicistnp.fromstring(mystr, dtype=np.uint8).view('S1')
。 - Divakar