将文本文件存储为二进制以实现更快的读/写

Question

将文本文件存储为二进制以实现更快的读/写

4

我有一大批需要处理的文本文件，目前由于使用了pmap()，性能已经相当不错，但我正在寻求额外的加速。当前的瓶颈是将字符串解析为浮点数。

我考虑将我的数据（管道分隔的文本文件）加载并写入二进制格式。据我所见，Julia应该能够更快地加载我的数据。问题在于，在将其写成二进制格式后，我无法正确地将我的二进制数据加载回Julia中。

这里是一些示例代码，用于加载、解析和写入二进制：

input_file = "/cool/input/file.dat"   ## -- pipe delimited text file of Floats
output_file = "/cool/input/data_binary" 

open(input_file, "r") do in_file
    open(output_file, "w") do out_file
        for line in eachline(in_file)
            split_line = split(line, '|')
            out_float = parse(Float64, split_line[4])
            write(out_file, out_float)
        end
    end
end

问题在于当我将上述文件加载到Julia中时，我不知道这些值是什么：

read(output_file)
n-element Array{UInt8,1}:
 0x00
 0x00
 0x00
 0x00
 0x00
 0x80
 0x16

我该如何在Julia代码中将这些二进制值作为浮点数使用？更普遍地说，如果我正在寻求提高性能，是否有意义将我的文本文件数据以这种方式转换为二进制？

- Jeremy McNees

只是确认这不是 XY 问题场景：您需要特定格式的原因是否非常明确，或者只是想找到有效的方法将预处理数据保存/加载到磁盘以备后续处理？ - Tasos Papastylianou

1

只是试图找到一种更有效的保存/加载数据的方法。我会看看JLD.jl。 - Jeremy McNees

3个回答

2

我不知道我得到了多少实际的加速，但是当我存储大量非结构化变量的集合时，我使用.h5格式来存储一些东西（与我将存储在关系数据库/.json文件中的东西相对应）。

- isebarn

2

“官方”解决方案/格式，用于从julia会话的工作区中“保存”数据是JLD package。

可以将其视为matlab中的.mat文件或python中的shelve的等效物。

或者，您可以使用serialize命令直接将数据序列化到文件中（但是，请阅读JLD页面中的警告）。

如果您需要在特定项目中使用您在问题中描述的特定“自制”序列化方法，则可以使用。但是，否则，它听起来就像您想要以一般方式高效地存储和访问序列化数据，因此请改用JLD。

例：

### in file "floats.dat"
1.0|2.0|3.0|4.0|5.0|6.0|7.0|8.0|9.0|10.0

# Using .jld files
julia> using JLD
julia> S = split( chomp( readstring("floats.dat")), '|');
julia> Floats = [parse(Float64, x) for x in S];
julia> save("myFloats.jld","Floats",Floats)
julia> load("myFloats.jld")
Dict{String,Any} with 1 entry:
  "Floats" => [1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0]

# Using serialize / deserialize
julia> S = split( chomp( readstring("floats.dat")), '|');
julia> Floats = [parse(Float64, x) for x in S];
julia> f = open("out.dat", "w"); serialize(f, Floats); close(f);
julia> f = open("out.dat", "r"); show(deserialize(f))
[1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0]

- Tasos Papastylianou

谢谢。这可能是比我尝试做的更好的方法。我会研究一下这个包。 - Jeremy McNees

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- HarmonicaMuse · Accepted Answer

您需要使用 reinterpret 函数：

help?> reinterpret
search: reinterpret

  reinterpret(type, A)

  Change the type-interpretation of a block of memory. For example,
  reinterpret(Float32, UInt32(7)) interprets the 4 bytes corresponding
  to UInt32(7) as a Float32. For arrays, this constructs an array with 
  the same binary data as the given array, but with the specified element
  type.

写入数字数据的函数：

julia> function write_data{T<:Number}(file_name::String, data::AbstractArray{T})
           open(file_name, "w") do f_out
               for i in data
                   write(f_out, i)
               end
           end
       end
write_data (generic function with 1 method)

随机数据：

julia> data = rand(10)
10-element Array{Float64,1}:
 0.986948
 0.616107
 0.504965
 0.673264
 0.0358904
 0.1795
 0.399481
 0.233351
 0.320968
 0.16746

读取二进制数据并将其重新解释为数字数据类型的函数：

julia> function read_data{T<:Number}(file_name::String, dtype::Type{T})
           open(file_name, "r") do f_in
               reinterpret(dtype, read(f_in))
           end
       end
read_data (generic function with 1 method)

将样本数据读取为Float64，得到的数组与我们编写的相同：

julia> read_data("foo.bin", Float64)
10-element Array{Float64,1}:
 0.986948
 0.616107
 0.504965
 0.673264
 0.0358904
 0.1795
 0.399481
 0.233351
 0.320968
 0.16746

将其重新解释为Float32，自然会产生两倍的数据：

julia> read_data("foo.bin", Float32)
20-element Array{Float32,1}:
  1.4035f7
  1.87174
 -9.17366f25
  1.77903
 -1.03106f-24
  1.75124
  1.9495f-20
  1.79332
  2.88032f-21
  1.26856
  1.17736f19
  1.5545
 -3.25944f-18
  1.69974
  5.25285f-17
  1.60835
 -3.46489f14
  1.66048
  1.91915f-25
  1.54246