为什么我不能在Docker容器中以非root用户身份运行IPU程序?

3
我想在Graphcore的TensorFlow 1.5 Docker镜像中以非root用户身份运行Graphcore示例存储库中的CNN训练,但出现以下错误:
2020-04-23 11:17:32.960014: I tensorflow/compiler/jit/xla_compilation_cache.cc:250] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.Saved checkpointto ./logs/RN152_bs1x16p_GN32_16.16_v1.1.11_6LT/ckpt-0
2020-04-23 11:19:07.615030: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at xla_ops.cc:361 : Unknown: [Error][Build graph] could not get temporary file for model 'MappedCodelet_%%%%%%%%%%%%%%.cpp': Permission denied
Traceback (most recent call last): 
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list,run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.UnknownError: [Error][Build graph] could not get temporary file for model 'MappedCodelet_%%%%%%%%%%%%%%.cpp': Permission denied
[[{{node cluster}}]]

当我以root用户身份运行程序时,它可以正常工作,但是当我创建一个新用户时,它开始抛出这个错误。这是否意味着Graphcore的Docker镜像只能在使用root用户时才能工作?
1个回答

4

非 root 用户也可以运行 IPU 程序。如果在运行中的 Docker 容器(和任何基于 Ubuntu 的环境)中切换用户,会导致环境变量被重置。这些环境变量包含了连接和运行 IPU 程序所需的重要配置设置。您可以避免此问题,通过在 Dockerfile 中对用户进行管理。下面是一个示例片段(其中 exampleshttps://github.com/graphcore/examples/ 的克隆):

FROM graphcore/tensorflow:1 
ENV LC_ALL=C.UTF-8 
ENV LANG=C.UTF-8 
RUN adduser [username]   
ADD examples examples 
RUN chown [username] -R examples 

然后您可以使用以下命令构建镜像:

docker image build . -t graphcore-examples 

现在,您有三种选项以非root用户身份运行CNN训练:
  1. 直接运行CNN训练:
gc-docker -- -ti -u [username] graphcore-examples python3 /examples/applications/tensorflow/cnns/training/train.py 
  1. 将容器作为非root用户启动进入bash shell,然后从那里运行训练:
gc-docker -- -ti -u [username] graphcore-examples 
$ python3 /examples/applications/tensorflow/cnns/training/train.py 
  1. 以 root 用户身份启动容器,然后在切换用户时保留环境变量:
gc-docker -- -ti graphcore-examples 
$ su --preserve-environment - [username] 
$ python3 /examples/applications/tensorflow/cnns/training/train.py 

我建议在可能的情况下使用选项1或2。您可以在这里找到有关gc-docker命令行工具的更多信息。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接