Tensorflow目标检测CPU推理速度慢

Question

Tensorflow目标检测CPU推理速度慢

performancetensorflowcpuobject-detection

4

系统信息

您使用的模型的顶级目录是什么：object_detection/ssd_inception_v2
我是否编写了自定义代码（而不是使用TensorFlow提供的示例脚本）：否
操作系统平台和版本（例如，Linux Ubuntu 16.04）：Ubuntu 16.04
TensorFlow安装方式（源码或二进制包）：二进制包
TensorFlow版本（使用以下命令）：1.2.1
Bazel版本（如果是从源代码编译）：无
CUDA/cuDNN版本：cuda 8.0
GPU型号和内存：Quadro M6000 24GB

在我的自定义数据集上训练ssd_inception_v2模型后，我希望将其用于推理。由于推理后的设备没有GPU，因此我只能使用CPU进行推理。我修改了opject_detection_tutorial.ipynb以测量推理时间，并让以下代码在视频中的一系列图像上运行。

with detection_graph.as_default():
  with tf.Session(graph=detection_graph) as sess:
    while success:
      #print(str(datetime.datetime.now().time()) + " " + str(count))
      #read image
      success,image = vidcap.read()
      #resize image
      image = cv2.resize(image , (711, 400))
      # crop image to fit 690 x 400
      image = image[ : , 11:691]
      # Expand dimensions since the model expects images to have shape: [1, None, None, 3]
      image_np_expanded = np.expand_dims(image, axis=0)
      #print(image_np_expanded.shape)
      image_tensor = detection_graph.get_tensor_by_name('image_tensor:0')
      # Each box represents a part of the image where a particular object was detected.
      boxes = detection_graph.get_tensor_by_name('detection_boxes:0')
      # Each score represent how level of confidence for each of the objects.
      # Score is shown on the result image, together with the class label.
      scores = detection_graph.get_tensor_by_name('detection_scores:0')
      classes = detection_graph.get_tensor_by_name('detection_classes:0')
      num_detections = detection_graph.get_tensor_by_name('num_detections:0')
      before = datetime.datetime.now()
      # Actual detection.
      (boxes, scores, classes, num_detections) = sess.run(
          [boxes, scores, classes, num_detections],
          feed_dict={image_tensor: image_np_expanded})
      print("This took : " + str(datetime.datetime.now() - before))  
      vis_util.visualize_boxes_and_labels_on_image_array(
          image,
          np.squeeze(boxes),
          np.squeeze(classes).astype(np.int32),
          np.squeeze(scores),
          category_index,
          use_normalized_coordinates=True,
          line_thickness=8)

      #cv2.imwrite("converted/frame%d.jpg" % count, image)     # save frame as JPEG file
      count += 1

使用以下输出：
这需要时间: 0:00:04.289925
这需要时间: 0:00:00.909071
这需要时间: 0:00:00.917636
这需要时间: 0:00:00.908391
这需要时间: 0:00:00.896601
这需要时间: 0:00:00.908698
这需要时间: 0:00:00.890018
这需要时间: 0:00:00.896373
.....

当然，每张图片900毫秒的速度对于视频处理来说并不够快。阅读了很多线程后，我看到有两种可能的改进方法：

图形转换工具：为了更快地获得冻结的推理图。（我犹豫是否要尝试这个，因为据我所知，我必须从源代码构建TF，而我通常对当前安装感到满意）
替换提要：看起来feed_dict={image_tensor:image_np_expanded}不是向TF图提供数据的好方法。 QueueRunner对象可以在这里提供帮助。

因此，我的问题是上述两种改进是否有潜力将推断提升到实时使用（10-20 fps），或者我走错了方向，应该尝试其他方法？欢迎任何建议。

- SaiBot

你尝试过使用SSD模型的Mobilenet特征提取器吗？ - ITiger

是的，这将时间缩短到了700毫秒，但仍不符合我的预期。 - SaiBot

1

我的情况甚至更糟，我看到每张图像需要45秒钟的时间，代码非常相似。 - bw4sz

在Tensorflow网站上，我找到了一个通用的性能指南 https://www.tensorflow.org/performance/performance_guide 。当我有时间时，我会逐个尝试这些建议，并测量对性能的影响。如果有人已经完成了这项工作或比我更快，我会非常感兴趣了解结果。 - SaiBot

我也尝试复制这个想法，但性能甚至更差。https://dev59.com/Hqbja4cB1Zd3GeqPbBlN - bw4sz

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- dragon7 · Accepted Answer

另一种选择是使用不同的推理工具包，例如OpenVINO。 OpenVINO专为英特尔硬件设计，但应该可以与任何CPU配合使用。它通过将模型转换为中间表示（IR），进行图剪枝并将某些操作融合到其他操作中来提高模型的准确性。然后，在运行时，它使用向量化。

将Tensorflow模型转换为OpenVINO相当简单，除非您有花哨的自定义层。如何执行此操作的完整教程可以在这里找到。以下是一些代码片段。

安装OpenVINO

最简单的方法是使用PIP。或者，您可以使用此工具在您的情况下找到最佳方法。

pip install openvino-dev[tensorflow2]

使用模型优化器将SavedModel模型转换为IR格式。

模型优化器是OpenVINO开发包中的命令行工具。它将Tensorflow模型转换为IR格式，这是OpenVINO的默认格式。您还可以尝试FP16精度，这应该可以在不显著降低准确性的情况下提高性能（只需更改data_type）。在命令行中运行：

mo --saved_model_dir "model" --input_shape "[1, 3, 224, 224]" --data_type FP32 --output_dir "model_ir"

运行推理

转换后的模型可以通过运行时加载，并编译为特定设备，例如CPU或GPU（集成到您的CPU中，如英特尔HD显卡）。如果您不知道哪个选择最好，请使用AUTO。

# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")

# Get output layer
output_layer_ir = compiled_model_ir.output(0)

# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]

免责声明：我在OpenVINO上工作。