Tensorflow只能看到XLA_GPU,无法使用它们。

3

我有一台8个GPU的机器(4x 11GB内存的GTX 1080 Ti GPU和4x RTX 1080 GPU),但无法使tensorflow正确地(或根本)使用它们。

当我执行以下操作时:

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

它打印

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 5295519098812813462
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 12186007115805339517
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_GPU:1"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 17706271046686153881
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_GPU:2"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 14710290295129432533
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_GPU:3"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 1381213064943868400
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_GPU:4"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 12093982778662340719
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_GPU:5"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 682960671898108683
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_GPU:6"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 9901240111105546679
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_GPU:7"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 8442134369143872649
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 1687638086072792879
physical_device_desc: "device: XLA_CPU device"
].

如果我尝试使用GPU,nvidia-smi会显示它们已被占用,但速度为0%,任务的速度表明tensorflow只是在使用CPU。 在其他具有相同设置的机器上,它还会打印'/device:GPU:2''/device:XLA_GPU:2'(例如),并且tensorflow可以毫无问题地使用它们。
我已经看过类似的问题和解决方案,但没有一个看起来能够解决问题。
1个回答

4

很可能是你安装了不兼容的CUDA。如果你使用pip安装tensorflow,那么请查看https://www.tensorflow.org/install/gpu,查看tensorflow版本和相应的CUDA版本(也要查看cudnn版本)。确保你已经安装了正确版本的tensorflow、CUDA和cudnn。或者,你可以选择从源代码构建tensorflow,但我对此的经验较少,你可以自己Google一下:) 祝好运!


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接