Apache Spark警告信息的含义：“在RowBasedKeyValueBatch上调用spill()”

Question

Apache Spark警告信息的含义：“在RowBasedKeyValueBatch上调用spill()”

21

我正在使用Apache Spark本地模式运行pyspark 2.2.0作业，并看到以下警告：

WARN RowBasedKeyValueBatch: Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.

这个警告的原因是什么？我应该关注它吗，还是可以安全地忽略它？

- asmaier

3个回答

5

我想这个消息比简单的警告更糟糕：它快要成为一个错误了。

看一下源代码：

 /**
   * Sometimes the TaskMemoryManager may call spill() on its associated MemoryConsumers to make
   * space for new consumers. For RowBasedKeyValueBatch, we do not actually spill and return 0.
   * We should not throw OutOfMemory exception here because other associated consumers might spill
   */
  public final long spill(long size, MemoryConsumer trigger) throws IOException {
    logger.warn("Calling spill() on RowBasedKeyValueBatch. Will not spill but return 0.");
    return 0;
  }

这里涉及到的是IT技术，具体内容可以在这个链接中找到：https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/RowBasedKeyValueBatch.java 所以我认为你陷入了一个“需要溢出但实际上没有溢出”的无限循环中。

- Romain Jouin

0

补充上述内容，当我运行 jupyter/scipy-notebook Docker 镜像（之后独立导入 PySpark）时，我收到了此警告。当切换到 jupyter/pyspark-notebook 镜像时，问题得到解决。

- thedatastrategist

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Fredz0r · Accepted Answer

如此处所示，此警告意味着您的RAM已满，并且部分RAM内容已移至磁盘。

请参阅Spark FAQ

我的数据是否需要适合内存才能使用Spark？

不需要。如果数据不适合内存，Spark的运算符将其溢写到磁盘，使其可以在任何大小的数据上运行。同样，不适合内存的缓存数据集要么被溢写到磁盘，要么在需要时根据RDD的存储级别重新计算。