我正在尝试在h2o中加载大于内存大小的数据。
h2o 博客 提到:关于更大的数据和GC的注意事项:当Java堆栈使用的Big Data超过物理DRAM时,我们会进行用户模式的交换到磁盘。我们不会因为GC死亡螺旋而死亡,但速度会降低到out-of-core速度。我们的速度将达到磁盘允许的最快速度。我个人测试过将12GB数据集加载到2GB(32位)JVM中;加载数据大约需要5分钟,运行逻辑回归需要另外5分钟。
这是连接到h2o 3.6.0.8
的R
代码:
h2o.init(max_mem_size = '60m') # alloting 60mb for h2o, R is running on 8GB RAM machine
提供
java version "1.8.0_65"
Java(TM) SE Runtime Environment (build 1.8.0_65-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.65-b01, mixed mode)
.Successfully connected to http://127.0.0.1:54321/
R is connected to the H2O cluster:
H2O cluster uptime: 2 seconds 561 milliseconds
H2O cluster version: 3.6.0.8
H2O cluster name: H2O_started_from_R_RILITS-HWLTP_tkn816
H2O cluster total nodes: 1
H2O cluster total memory: 0.06 GB
H2O cluster total cores: 4
H2O cluster allowed cores: 2
H2O cluster healthy: TRUE
Note: As started, H2O is limited to the CRAN default of 2 CPUs.
Shut down and restart H2O as shown below to use all your CPUs.
> h2o.shutdown()
> h2o.init(nthreads = -1)
IP Address: 127.0.0.1
Port : 54321
Session ID: _sid_b2e0af0f0c62cd64a8fcdee65b244d75
Key Count : 3
我尝试将一个169 MB的csv文件加载到h2o中。
dat.hex <- h2o.importFile('dat.csv')
出现错误,
Error in .h2o.__checkConnectionHealth() :
H2O connection has been severed. Cannot connect to instance at http://127.0.0.1:54321/
Failed to connect to 127.0.0.1 port 54321: Connection refused
这表明内存不足error。
问题:如果H2o承诺加载一个超出其内存容量的数据集(如上面的博客引用所说的交换到磁盘机制),那么这是加载数据的正确方式吗?