OpenMPI:尝试使用mpirun时出现权限被拒绝错误

3

我想通过以下代码,在不同的Google云计算实例上使用MPI显示“hello world”:

from mpi4py import MPI

size = MPI.COMM_WORLD.Get_size()
rank = MPI.COMM_WORLD.Get_rank()
name = MPI.Get_processor_name()

print("Hello, World! I am process/rank {} of {} on {}.\n".format(rank, size, name))    

问题在于,尽管我可以在这些实例之间轻松进行ssh连接,但我在尝试运行脚本时却收到了权限被拒绝的错误消息。我使用以下命令来调用我的脚本:

mpirun --host localhost,instance_1,instance_2 python hello_world.py

.

然后我收到以下错误消息:

Permission denied (publickey).
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------

附加信息:

  • 我在所有节点上安装了open-MPI
  • 我使用gcloud从每个实例登录到每个实例,使Google自动设置我的所有ssh密钥
  • 实例类型:n1-standard-1
  • 实例操作系统:Linux Debian(默认)

谢谢你的帮助 :-)

新信息:
(感谢 @Zulan 指出我应该编辑以前的帖子而不是创建新答案来获取新信息)

所以,我尝试用mpich代替openmpi做同样的事情。然而,我遇到了类似的错误消息。

命令:

mpirun --host localhost,instance_1,instance_2 python hello_world.py

错误消息:

Host key verification failed.

我可以在我的两个实例之间进行ssh连接,而通过gcloud命令,ssh密钥应该会自动正确设置。那么,有人有什么想法,可能出了什么问题?我还检查了路径、防火墙规则以及在临时文件夹中编写启动脚本的能力。请问是否可以尝试重新创建此问题?+ 我应该向Google提出这个问题吗?(以前从未做过这样的事情,我非常不确定:S)感谢您的帮助 :)
1个回答

1

所以我终于找到了解决方案。哇,这个问题让我疯狂。

事实证明,我需要手动生成ssh密钥才能使脚本正常工作。我不知道为什么,因为谷歌服务已经使用gcloud compute ssh设置了密钥,但是好吧,它起作用了 :)

我所做的步骤:

instance_1 $ ssh-keygen -t rsa
instance_1 $ cd .ssh
instance_1 $ cat id_rsa.pub >> authorized_keys
instance_1 $ gcloud compute copy-files id_rsa.pub 
instance_1 $ gcloud compute ssh instance_2

instance_2 $ cd .ssh
instance_2 $ cat id_rsa.pub >> authorized_keys

.

我将开启另一个话题并询问为什么我不能使用ssh instance_2,即使gcloud compute ssh instance_2是可行的。请参见:“gcloud compute ssh”和“ssh”命令之间的区别


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接