由于某种原因,目前针对表格数据类型Table
的Import
实现非常浪费内存。下面我尝试着在仍然重用Mathematica高级导入功能(通过ImportString
)的情况下,改善这种情况。对于稀疏表格,提供了另一种解决方案,可以大大节省内存。
通用节省内存的解决方案
下面是一个更加节省内存的函数:
Clear[readTable];
readTable[file_String?FileExistsQ, chunkSize_: 100] :=
Module[{str, stream, dataChunk, result , linkedList, add},
SetAttributes[linkedList, HoldAllComplete];
add[ll_, value_] := linkedList[ll, value];
stream = StringToStream[Import[file, "String"]];
Internal`WithLocalSettings[
Null,
(* main code *)
result = linkedList[];
While[dataChunk =!= {},
dataChunk =
ImportString[
StringJoin[Riffle[ReadList[stream, "String", chunkSize], "\n"]],
"Table"];
result = add[result, dataChunk];
];
result = Flatten[result, Infinity, linkedList],
(* clean-up *)
Close[stream]
];
Join @@ result]
在这里,我将其与标准的Import
进行对比,针对您的文件:
In[3]:= used = MaxMemoryUsed[]
Out[3]= 18009752
In[4]:=
tt = readTable["C:\\Users\\Archie\\Downloads\\ExampleFile\\ExampleFile.txt"]
Out[4]= {34.367,Null}
In[5]:= used = MaxMemoryUsed[]-used
Out[5]= 228975672
In[6]:=
t = Import["C:\\Users\\Archie\\Downloads\\ExampleFile\\ExampleFile.txt","Table"]
Out[6]= {25.615,Null}
In[7]:= used = MaxMemoryUsed[]-used
Out[7]= 2187743192
In[8]:= tt===t
Out[8]= True
您可以看到,我的代码比
Import
更节省大约10倍的内存,而且速度并不慢。您可以通过调整
chunkSize
参数来控制内存消耗。生成的表占用大约150-200 MB的RAM。
编辑
让稀疏表更加高效
我想说明如何使此函数在导入期间变得更加节省2-3倍的内存,并且在最终内存占用方面使用
SparseArray
可以获得更高的内存效率。我们获得内存效率增益的程度在很大程度上取决于您的表有多么稀疏。在您的示例中,该表非常稀疏。
稀疏数组的解剖
我们从一个通常有用的API开始,用于构建和拆分
SparseArray
对象:
ClearAll[spart, getIC, getJR, getSparseData, getDefaultElement, makeSparseArray];
HoldPattern[spart[SparseArray[s___], p_]] := {s}[[p]];
getIC[s_SparseArray] := spart[s, 4][[2, 1]];
getJR[s_SparseArray] := Flatten@spart[s, 4][[2, 2]];
getSparseData[s_SparseArray] := spart[s, 4][[3]];
getDefaultElement[s_SparseArray] := spart[s, 3];
makeSparseArray[dims : {_, _}, jc : {__Integer}, ir : {__Integer},
data_List, defElem_: 0] :=
SparseArray @@ {Automatic, dims, defElem, {1, {jc, List /@ ir}, data}};
需要说明一些简要评论。这是一个稀疏数组示例:
In[15]:=
ToHeldExpression@ToString@FullForm[sp = SparseArray[{{0,0,1,0,2},{3,0,0,0,4},{0,5,0,6,7}}]]
Out[15]=
Hold[SparseArray[Automatic,{3,5},0,{1,{{0,2,4,7},{{3},{5},{1},{5},{2},{4},{5}}},
{1,2,3,4,5,6,7}}]]
(I used the
ToString
-
ToHeldExpression
cycle to convert
List[...]
etc. in the
FullForm
back to
{...}
for easier reading). Here,
{3,5}
clearly represents the dimensions of the array. Following this is a default value of
0
, and then a nested list which we can denote as
{1,{ic,jr}, sparseData}
. The
ic
list gives the cumulative number of non-zero elements added for each row - starting with 0, then adding 2 after the first row, 2 more after the second, and finally 3 more after the last. The
jr
list gives the positions of the non-zero elements in each row, where the first row has elements at positions
3
and
5
, the second row has elements at positions
1
and
5
, and the last row has elements at positions
2
,
4
, and
5
. The ordering of the elements in the
sparseData
list follows the same order as the elements in the
jr
list, and represents the non-zero values read row by row from left to right. Knowing this internal format should hopefully clarify the role of the above functions.
The code:
Clear[readSparseTable];
readSparseTable[file_String?FileExistsQ, chunkSize_: 100] :=
Module[{stream, dataChunk, start, ic = {}, jr = {}, sparseData = {},
getDataChunkCode, dims},
stream = StringToStream[Import[file, "String"]];
getDataChunkCode :=
If[# === {}, {}, SparseArray[#]] &@
ImportString[
StringJoin[Riffle[ReadList[stream, "String", chunkSize], "\n"]],
"Table"];
Internal`WithLocalSettings[
Null,
(* main code *)
start = getDataChunkCode;
ic = getIC[start];
jr = getJR[start];
sparseData = getSparseData[start];
dims = Dimensions[start];
While[True,
dataChunk = getDataChunkCode;
If[dataChunk === {}, Break[]];
ic = Join[ic, Rest@getIC[dataChunk] + Last@ic];
jr = Join[jr, getJR[dataChunk]];
sparseData = Join[sparseData, getSparseData[dataChunk]];
dims[[1]] += First[Dimensions[dataChunk]];
],
(* clean - up *)
Close[stream]
];
makeSparseArray[dims, ic, jr, sparseData]]
基准测试和比较
以下是使用的初始内存量(新内核):
In[10]:= used = MemoryInUse[]
Out[10]= 17910208
我们调用我们的函数:
In[11]:=
(tsparse= readSparseTable["C:\\Users\\Archie\\Downloads\\ExampleFile\\ExampleFile.txt"])
Out[11]= {39.874,Null}
所以,它的速度与 readTable
相同。那么内存使用情况如何?
In[12]:= used = MaxMemoryUsed[]-used
Out[12]= 80863296
我认为这非常值得注意:我们只使用了比磁盘上文件本身所占用的两倍更多的内存。但更令人惊讶的是,计算完成后的最终内存使用量大大减少:
In[13]:= MemoryInUse[]
Out[13]= 26924456
这是因为我们使用了
SparseArray
:
In[15]:= {tsparse,ByteCount[tsparse]}
Out[15]= {SparseArray[<326766>,{9429,2052}],12103816}
因此,我们的表只占用了12 MB的内存。我们可以将其与更通用的函数进行比较:
In[18]:=
(t = readTable["C:\\Users\\Archie\\Downloads\\ExampleFile\\ExampleFile.txt"]);//Timing
Out[18]= {38.516,Null}
我们将稀疏表转换回普通表后,结果是相同的:
In[20]:= Normal@tsparse==t
Out[20]= True
普通表格占用的空间要多得多(似乎ByteCount
把占用的内存计算了3-4倍,但实际差异仍然至少是一个数量级):
In[21]:= ByteCount[t]
Out[21]= 619900248
ByteCount
,发现这个函数给出了令人困惑的结果。 - Alexey Popkov