Tensorflow-gpu无法获取卷积算法

3

我正在尝试制作一个卷积神经网络来分析微软的猫和狗数据集。我使用的是tensorflow-gpu 1.12.0,jupyter笔记本和Windows 10上的anaconda。我的GPU是GTX 1080。我安装了CUDA和cuDNN,并且我相当确定我已经正确设置了它们。我已经检查了版本。以下是我的代码(我在jupyter中将其放在不同的单元格中)。

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten, Conv2D, MaxPooling2D
import pickle


import sys
print(sys.executable)
print(tf.__version__)


gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction = 0.4)
session = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
print('GPU Settings set')


X = pickle.load(open('X.pickle','rb')) # Brings in the "pictures" of the training set
y = pickle.load(open('y.pickle','rb')) # Brings in the answers


X = X/255.0 # Normalizes the model so each number is between 0 and 1

print('Data Loaded')

model = Sequential()

model.add(Conv2D(64, (3,3), input_shape = X.shape[1:]))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Conv2D(64, (3,3)))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Flatten())
model.add(Dense(64))

model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss="binary_crossentropy", optimizer='adam', metrics = ['accuracy'])

model.fit(X, y, batch_size=25, epochs=3, validation_split=0.1)

我得到了这个错误:

Train on 22451 samples, validate on 2495 samples
Epoch 1/3
---------------------------------------------------------------------------
UnknownError                              Traceback (most recent call last)
<ipython-input-6-9cef6147c3c5> in <module>
     17 model.compile(loss="binary_crossentropy", optimizer='adam', metrics = ['accuracy'])
     18 
---> 19 model.fit(X, y, batch_size=25, epochs=3, validation_split=0.1)

~\Anaconda3\envs\learning\lib\site-packages\tensorflow\python\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, max_queue_size, workers, use_multiprocessing, **kwargs)
   1637           initial_epoch=initial_epoch,
   1638           steps_per_epoch=steps_per_epoch,
-> 1639           validation_steps=validation_steps)
   1640 
   1641   def evaluate(self,

~\Anaconda3\envs\learning\lib\site-packages\tensorflow\python\keras\engine\training_arrays.py in fit_loop(model, inputs, targets, sample_weights, batch_size, epochs, verbose, callbacks, val_inputs, val_targets, val_sample_weights, shuffle, initial_epoch, steps_per_epoch, validation_steps)
    213           ins_batch[i] = ins_batch[i].toarray()
    214 
--> 215         outs = f(ins_batch)
    216         if not isinstance(outs, list):
    217           outs = [outs]

~\Anaconda3\envs\learning\lib\site-packages\tensorflow\python\keras\backend.py in __call__(self, inputs)
   2984 
   2985     fetched = self._callable_fn(*array_vals,
-> 2986                                 run_metadata=self.run_metadata)
   2987     self._call_fetch_callbacks(fetched[-len(self._fetches):])
   2988     return fetched[:len(self.outputs)]

~\Anaconda3\envs\learning\lib\site-packages\tensorflow\python\client\session.py in __call__(self, *args, **kwargs)
   1437           ret = tf_session.TF_SessionRunCallable(
   1438               self._session._session, self._handle, args, status,
-> 1439               run_metadata_ptr)
   1440         if run_metadata:
   1441           proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

~\Anaconda3\envs\learning\lib\site-packages\tensorflow\python\framework\errors_impl.py in __exit__(self, type_arg, value_arg, traceback_arg)
    526             None, None,
    527             compat.as_text(c_api.TF_Message(self.status.status)),
--> 528             c_api.TF_GetCode(self.status.status))
    529     # Delete the underlying status object from memory otherwise it stays alive
    530     # as there is a reference to status from this from the traceback due to

UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[{{node conv2d_3/Conv2D}} = Conv2D[T=DT_FLOAT, _class=["loc:@training_2/Adam/gradients/conv2d_3/Conv2D_grad/Conv2DBackpropFilter"], data_format="NCHW", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](training_2/Adam/gradients/conv2d_3/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer, conv2d_3/Conv2D/ReadVariableOp)]]
     [[{{node loss_2/activation_7_loss/broadcast_weights/assert_broadcastable/AssertGuard/Assert/Switch/_329}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_321_l...ert/Switch", tensor_type=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

您可能需要添加 from tensorflow.keras.layers import LSTM,但您是否尝试在不使用 per_process_gpu_memory_fraction 的情况下运行它?注释掉第8-10行代码并测试它是否能正常运行,虽然速度可能会慢些。 - Suleiman
@Suleiman 我尝试过不使用 per_process_gpu_memory_fraction 和 LSTM,以及使用 per_process_gpu_memory_fraction 和 LSTM,但还是出现了相同的错误。 - Zachary Perkins
1
你试过这个吗:config.gpu_options.allow_growth = True?这个解决了我的问题。 - Mastiff
1个回答

0
希望这个链接可以解决你的问题,因为你安装的cnDNN版本与tensorflow编译的cuDNN版本不兼容。
复制一个新的CUDNN库,应该就可以解决了。

那个命令对我不起作用,而且我确保使用了正确的cuDNN版本。 - Zachary Perkins

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接