设置PyCharm远程conda解释器

16
我正在尝试在MacOS Mojave上为Anaconda的PyCharm 2019.1.2 Pro设置远程conda解释器,但无法使其正常工作。我的现有远程conda环境(v4.5.12)运行在一个Ubuntu 16 EC2机器上,该机器是从Amazon的Deep Learning AMI实例化的。
我尝试了设置ssh解释器,并将其指向:/home/ubuntu/anaconda3/envs/tensorflow_p36/bin/python这是我的conda环境。然后我尝试在此解释器上运行简单的Tensorflow GPU测试,并收到以下消息,强烈表明环境未被激活:(服务器的IP地址和公司名称被故意隐藏)。
ssh://ubuntu@xx.xx.xx.xx:22/home/ubuntu/anaconda3/envs/tensorflow_p36/bin/python -u /home/ubuntu/company/DeepLearning_copy/apps/test_gpu.py
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/imp.py", line 243, in load_module
    return load_dynamic(name, filename, file)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/imp.py", line 343, in load_dynamic
    return _load(spec)
ImportError: libcublas.so.10.0: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/company/DeepLearning_copy/apps/test_gpu.py", line 1, in <module>
    import tensorflow as tf
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/__init__.py", line 24, in <module>
    from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 74, in <module>
    raise ImportError(msg)
ImportError: Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/imp.py", line 243, in load_module
    return load_dynamic(name, filename, file)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/imp.py", line 343, in load_dynamic
    return _load(spec)
ImportError: libcublas.so.10.0: cannot open shared object file: No such file or directory


Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/errors

for some common reasons and solutions.  Include the entire stack trace
above this error message when asking for help.

Process finished with exit code 1

当通过SSH进入服务器并运行conda activate tensorflow_p36,然后运行python gpu_test.py时,代码可以完美运行。

我希望有任何解决方法来允许使用现有的远程conda环境进行远程调试。 同时,我已经向JetBrains的问题反馈页面Anaconda社区组提出了问题。

编辑:请参见JetBrains问题页面中的一个潜在解决方法。


请提供上下文,说明您尝试了什么,哪些不起作用。 - Diane M
强烈建议检查环境是否已激活。回溯信息表明使用了虚拟环境中的包,即错误是由venv的Tensorflow引起的。看起来错误出现在Tensorflow而不是Python安装中。您可能遇到了Tensorflow / cuda兼容性问题,就像其他一些用户一样(https://github.com/tensorflow/tensorflow/issues/26209)。 - Diane M
感谢@ArthurHavlicek。我相信这种行为与设置PATH方式一致,就像这样,但不运行source activate tensorflow_p36 - Assif
我还没有验证这一点,但我怀疑(因为您正在使用AWS AMI),这是因为AWS编译了一个优化版本的Tensorflow,在您第一次conda activate您的环境时安装(例如conda activate tensorflow_p36)。也许您可以尝试从pip重新安装tensorflow-gpu并尝试? - tauculator
4个回答

2

你可以做以下几件事情:

  1. 进入“运行/调试配置”
  2. 在“环境”下,你可以看到“环境变量”
  3. 你需要设置正确的cuda路径。在我的情况下是:“LD_LIBRARY_PATH=/usr/local/cuda-9.0/lib64”

我也很失望,这不是JetBrains团队默认完成的。


0

不需要指定“python”路径,我可以像这样指定“activate”路径:

ssh [host] "source ~/anaconda3/bin/activate [name of conda env] ; cd [pick a dir] ; [command]"

对于 [command],尝试使用 "conda env list" 命令查看已激活的环境。或者您可以执行 "python foo.py" 命令。

您可能需要调整路径 "~/anaconda3/bin/activate"。


-1

楼主,可能是有人在你的环境中做了一些事情,导致CUDA安装出现问题,就像其他几个人提到的那样。

我刚在AWS上配置了一个新的深度学习AMI实例 - 这对你来说是一个可行的选择吗?

无论如何,在通过ssh连接到(新配置的)服务器后,我执行了以下步骤:

初始激活

$ conda activate tensorflow_p36
WARNING: First activation might take some time (1+ min).
Installing TensorFlow optimized for your Amazon EC2 instance......
Env where framework will be re-installed: tensorflow_p36
Instance p2.xlarge is identified as a GPU instance, removing tensorflow-serving-cpu
Installation complete.

场景1:在tensorflow_p36 conda环境中运行GPU测试:

按照OP的场景执行以下操作,以确保Tensorflow正常工作。

$ python
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> # Creates a graph.
... a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
>>> b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
>>> c = tf.matmul(a, b)
>>> # Creates a session with log_device_placement set to True.
... sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7
>>> # Runs the op.
... print(sess.run(c))
MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
a: (Const): /job:localhost/replica:0/task:0/device:GPU:0
b: (Const): /job:localhost/replica:0/task:0/device:GPU:0
[[22. 28.]
 [49. 64.]]

场景2:停用环境,并调用与环境内相同的python可执行文件。

这应该与配置远程解释器使用特定的python解释器相同。请注意,在sess = tf.Session(...)之后有更多的输出,但仍然可以正常运行。

$ conda deactivate
$ /home/ubuntu/anaconda3/envs/tensorflow_p36/bin/python

Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> # Creates a graph.
... a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
>>> b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
>>> c = tf.matmul(a, b)
>>> # Creates a session with log_device_placement set to True.
... sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
2019-05-31 07:14:23.840474: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-31 07:14:23.841300: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55ec160ca020 executing computations on platform CUDA. Devices:
2019-05-31 07:14:23.841334: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2019-05-31 07:14:23.843647: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300060000 Hz
2019-05-31 07:14:23.843845: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55ec16131af0 executing computations on platform Host. Devices:
2019-05-31 07:14:23.843870: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-05-31 07:14:23.844965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 11.11GiB
2019-05-31 07:14:23.844992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-05-31 07:14:23.845991: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-31 07:14:23.846013: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-05-31 07:14:23.846020: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-05-31 07:14:23.846577: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10805 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7
2019-05-31 07:14:23.847176: I tensorflow/core/common_runtime/direct_session.cc:317] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7

>>> # Runs the op.
... print(sess.run(c))
MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2019-05-31 07:14:25.478310: I tensorflow/core/common_runtime/placer.cc:1059] MatMul: (MatMul)/job:localhost/replica:0/task:0/device:GPU:0
a: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-05-31 07:14:25.478383: I tensorflow/core/common_runtime/placer.cc:1059] a: (Const)/job:localhost/replica:0/task:0/device:GPU:0
b: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-05-31 07:14:25.478413: I tensorflow/core/common_runtime/placer.cc:1059] b: (Const)/job:localhost/replica:0/task:0/device:GPU:0
[[22. 28.]
[49. 64.]]

场景3:尝试使用特定的conda环境解释器作为Jetbrains PyCharm中的远程解释器,在PyCharm Python控制台中使用

请注意,输出基本与上述场景2相同,但Tensorflow GPU测试正常工作,不会抛出任何错误。

ssh://ubuntu@XX.XX.XX.XX:22/home/ubuntu/anaconda3/envs/tensorflow_p36/bin/python -u /home/ubuntu/.pycharm_helpers/pydev/pydevconsole.py --mode=server

Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
Type 'copyright', 'credits' or 'license' for more information
IPython 6.4.0 -- An enhanced Interactive Python. Type '?' for help.
PyDev console: using IPython 6.4.0
Python 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0] on linux

import tensorflow as tf
# Creates a graph.
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(c))
2019-05-31 07:17:03.883169: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-31 07:17:03.883577: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55be28eef280 executing computations on platform CUDA. Devices:
2019-05-31 07:17:03.883609: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2019-05-31 07:17:03.886035: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300060000 Hz
2019-05-31 07:17:03.886752: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55be28f56d50 executing computations on platform Host. Devices:
2019-05-31 07:17:03.886777: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-05-31 07:17:03.886983: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 508.38MiB
2019-05-31 07:17:03.887009: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-05-31 07:17:03.887658: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-31 07:17:03.887681: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-05-31 07:17:03.887697: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-05-31 07:17:03.887881: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 283 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7
2019-05-31 07:17:03.889133: I tensorflow/core/common_runtime/direct_session.cc:317] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7
MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2019-05-31 07:17:03.890673: I tensorflow/core/common_runtime/placer.cc:1059] MatMul: (MatMul)/job:localhost/replica:0/task:0/device:GPU:0
a: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-05-31 07:17:03.890718: I tensorflow/core/common_runtime/placer.cc:1059] a: (Const)/job:localhost/replica:0/task:0/device:GPU:0
b: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2019-05-31 07:17:03.890750: I tensorflow/core/common_runtime/placer.cc:1059] b: (Const)/job:localhost/replica:0/task:0/device:GPU:0
[[22. 28.]
[49. 64.]]

谢谢!不幸的是,当我停用并尝试使用该环境运行时,它无法工作(我得到与问题中相同的错误)。因此,只有在激活环境时才能运行,而在未激活时无法运行。 - Assif
@Assif - 如果您提供了一个新实例,它是否有效?在另一个环境中,我遇到了与您相同的问题,但在新提供的实例上一切正常运行。您是否在提供后更改了实例类型,例如p2.xlarge -> 非GPU实例?我怀疑这可能会影响CUDA安装... - tauculator
谢谢@user5042861,我在实例创建后没有更改实例类型。当我创建一个新的实例时,问题仍然存在。 - Assif

-2

我认为这是一个CUDA错误。CUDA没有正确配置。你是否在使用tensorflow-gpu?


谢谢!我确信环境已经设置好了,有两个原因:(1)它是通过AWS的DL AMI设置的(2)当通过SSH登录到服务器时,运行“conda activate tensorflow_p36”,然后运行“python gpu_test.py”时,代码可以完美地运行。 - Assif
确实,我正在使用tensorflow-gpu。 - Assif
所以,我确定这是一个错误,它是CUDA错误。请重新配置它。 - Dhruv Rajkotia
只有在我通过 PyCharm SSH 解释器运行脚本时才会出现此问题。如果我 SSH 进入服务器,激活环境,然后运行脚本 - 它就可以完美运行。因此,我确信环境已经正确配置,但是在远程脚本执行时,PyCharm 没有激活它。 - Assif

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接