我的输入数据集大约有150G。 我正在设置
--conf spark.cores.max=100
--conf spark.executor.instances=20
--conf spark.executor.memory=8G
--conf spark.executor.cores=5
--conf spark.driver.memory=4G
但是由于数据在执行器之间分配不均,我一直遇到这个问题:
Container killed by YARN for exceeding memory limits. 9.0 GB of 9 GB physical memory used
以下是我的问题:
1. Did I not set up enough memory in the first place? I think 20 * 8G > 150G, but it's hard to make perfect distribution, so some executors will suffer
2. I think about repartition the input dataFrame, so how can I determine how many partition to set? the higher the better, or?
3. The error says "9 GB physical memory used", but i only set 8G to executor memory, where does the extra 1G come from?
谢谢你!