与OpenCV相比,双线性插值速度极慢

6
template<typename T>
cv::Mat_<T> const bilinear_interpolation(cv::Mat_<T> const &src, cv::Size dsize,
                                     float dx, float dy)
{
    cv::Mat_<T> dst = dsize.area() == 0 ? cv::Mat_<T>(src.rows * dy, src.cols * dx) :
                                        cv::Mat_<T>(dsize);
  
    float const x_ratio = static_cast<float>((src.cols - 1)) / dst.cols;
    float const y_ratio = static_cast<float>((src.rows - 1)) / dst.rows;
    for(int row = 0; row != dst.rows; ++row)
    {
        int y = static_cast<int>(row * y_ratio);
        float const y_diff = (row * y_ratio) - y; //distance of the nearest pixel(y axis)
        float const y_diff_2 = 1 - y_diff;
        auto *dst_ptr = &dst(row, 0)[0];
        for(int col = 0; col != dst.cols; ++col)
        {
            int x = static_cast<int>(col * x_ratio);
            float const x_diff = (col * x_ratio) - x; //distance of the nearest pixel(x axis)
            float const x_diff_2 = 1 - x_diff;
            float const y2_cross_x2 = y_diff_2 * x_diff_2;
            float const y2_cross_x = y_diff_2 * x_diff;
            float const y_cross_x2 = y_diff * x_diff_2;
            float const y_cross_x = y_diff * x_diff;
            for(int channel = 0; channel != cv::DataType<T>::channels; ++channel)
            {
                *dst_ptr++ = y2_cross_x2 * src(y, x)[channel] +
                             y2_cross_x * src(y, x + 1)[channel] +
                             y_cross_x2 * src(y + 1, x)[channel] +
                             y_cross_x * src(y + 1, x + 1)[channel];
            }
        }
    }
    
    return dst;
}

这是双线性插值的实现,我用它将一张512*512像素的图像("lena.png")放大到2048*2048像素。这个过程花费了我0.195秒的时间,但OpenCV的cv::resize(非GPU版本)只需要0.026秒。我不知道我程序为什么这么慢(OpenCV比我快了近750%),我想看看OpenCV的resize源代码,但我找不到它的实现。
你有什么想法,为什么OpenCV的resize速度会这么快,或者我的双线性插值太慢了吗?
    {
        timeEstimate<> time;
        cv::Mat_<cv::Vec3b> const src = input;
        bilinear_interpolation(src, cv::Size(), dx, dy);
        std::cout << "bilinear" << std::endl;
    }

    {
        timeEstimate<> time;
        cv::Mat output = input.clone();
        cv::resize(input, output, cv::Size(), dx, dy, cv::INTER_LINEAR);
        std::cout << "bilinear cv" << std::endl;
    }

编译器:mingw4.6.2
操作系统:win7 64位
CPU:Intel® i3-2330M (2.2G)


1
可能cv::resize正在利用处理器指令集扩展,例如SSE,允许它并行运行多个乘法,例如。要自己执行此操作,您需要智能编译器优化或手动编写x86汇编语言。 编辑: 根据消息来源,GCC使用-O3标志启用矢量化。。(但是,-O3可能会导致非常奇怪的错误,因此不建议在一般情况下使用。) - user308323
谢谢,我最近会关注OpenCL,并希望由GPGPU开发的代码能够更具可移植性。 - StereoMatching
你可以通过考虑特定情况下的具体实现来加速它,比如典型的RGB整数图像,每个通道有8位深度。这种插值可以通过更少的乘法和更多的位运算来执行。虽然我没有适合在评论中使用的实现,但你可以在http://cboard.cprogramming.com/game-programming/19926-super-fast-bilinear-interpolation.html找到一个实现,我没有检查其正确性。 - mmgp
我研究了hqx https://github.com/Arcnor/hqx-java/tree/master/src/hqx中的类似代码,我需要将cv::Vec3b(或4b)转换为int,然后再将int转换回cv::Vec3b以保存或显示图像,这不是一个非常通用的解决方案,但值得一试。 - StereoMatching
3个回答

5

OpenCV的版本更快主要有两个方面:

  1. OpenCV将resize实现为“可分离操作”。即它分为两个步骤:先水平拉伸图像,再垂直拉伸。这种技术可以使用更少的算术运算来实现resize。

  2. 手工编写SSE优化。


我查看了第一种解决方案,但无法弄清如何将操作分为两个步骤。它可以用于一些矩阵乘法和DCT,但我不知道如何将其应用于双线性插值。 - StereoMatching

3

最近我在将双线性插值应用到一些基于CPU的图形代码中时,遇到了同样的问题。

首先,我使用以下配置运行了您的代码:

操作系统:虚拟机中的Xubuntu 20
编译器:gcc 9.3.0
OpenCV版本:4.2.0
CPU:i3-6100u(2.3 GHz)
源位图大小:512x512
目标位图大小:2048x2048

我发现您的代码需要92毫秒,而OpenCV只需要4.2毫秒。因此,与2012年您提出这个问题时相比,差异现在更大了。我猜测OpenCV自那时以来进行了更多的优化。

(此时,我切换到在Windows中使用Visual Studio 2013,构建为x64目标)。

将代码转换为使用定点算术可将时间缩短至30毫秒。定点算术很有帮助,因为它将数据保留为整数。输入和输出数据都是整数。将它们转换为浮点数再转回去是代价高昂的。如果我坚持使用GCC 9.3,我预计加速会更快,因为我通常发现它生成的代码比VS 2013更快。无论如何,以下是代码:

typedef union {
    unsigned c;
    struct { unsigned char b, g, r, a; };
} DfColour;

typedef struct _DfBitmap {
    int width, height;
    DfColour *pixels;
} DfBitmap;

void bilinear_interpolation(DfBitmap *src, DfBitmap *dst, float scale) {
    unsigned heightRatio = (double)(1<<8) * 255.0 / scale;
    unsigned widthRatio = (double)(1<<8) * 255.0 / scale;
    int dstH = scale * src->height;
    int dstW = scale * src->width;

    // For every output pixel...
    for (int y = 0; y < dstH; y++) {
        int srcYAndWeight = (y * heightRatio) >> 8;
        int srcY = srcYAndWeight >> 8;

        DfColour *dstPixel = &dst->pixels[y * dst->width];
        DfColour *srcRow = &src->pixels[srcY * src->width];

        unsigned weightY2 = srcYAndWeight & 0xFF;
        unsigned weightY = 256 - weightY2;

        for (int x = 0; x < dstW; x++, dstPixel++) {
            // Perform bilinear interpolation on 2x2 src pixels.

            int srcXAndWeight = (x * widthRatio) >> 8;
            int srcX = srcXAndWeight >> 8;

            unsigned r = 0, g = 0, b = 0;
            unsigned weightX2 = srcXAndWeight & 0xFF;
            unsigned weightX = 256 - weightX2;

            // Pixel 0,0
            DfColour *srcPixel = &srcRow[srcX];
            unsigned w = (weightX * weightY) >> 8;
            r += srcPixel->r * w;
            g += srcPixel->g * w;
            b += srcPixel->b * w;

            // Pixel 1,0
            srcPixel++;
            w = (weightX2 * weightY) >> 8;
            r += srcPixel->r * w;
            g += srcPixel->g * w;
            b += srcPixel->b * w;

            // Pixel 1,1
            srcPixel += src->width;
            w = (weightX2 * weightY2) >> 8;
            r += srcPixel->r * w;
            g += srcPixel->g * w;
            b += srcPixel->b * w;

            // Pixel 0,1
            srcPixel--;
            w = (weightX * weightY2) >> 8;
            r += srcPixel->r * w;
            g += srcPixel->g * w;
            b += srcPixel->b * w;

            dstPixel->r = r >> 8;
            dstPixel->g = g >> 8;
            dstPixel->b = b >> 8;
        }
    }
}

切换到更好的算法将时间缩短到了19.5毫秒。正如Andrey Kamaev的答案所说,更好的算法通过将垂直和水平调整大小分为两个单独的过程来工作。目标位图用作第一次传递的输出的临时存储空间。第二遍通过中的X遍历是向后的,以避免覆盖它即将需要的数据。这是代码:

void bilinear_interpolation(DfBitmap *src, DfBitmap *dst, float scale) {
    unsigned heightRatio = (double)(1<<8) * 255.0 / scale;
    unsigned widthRatio = (double)(1<<8) * 255.0 / scale;
    int dstH = scale * src->height;
    int dstW = scale * src->width;

    for (int y = 0; y < dstH; y++) {
        int srcYAndWeight = (y * heightRatio) >> 8;
        int srcY = srcYAndWeight >> 8;

        DfColour *dstPixel = &dst->pixels[y * dst->width];
        DfColour *srcRow = &src->pixels[srcY * src->width];

        unsigned weightY2 = srcYAndWeight & 0xFF;
        unsigned weightY = 256 - weightY2;

        for (int x = 0; x < src->width; x++, dstPixel++) {
            unsigned r = 0, g = 0, b = 0;

            // Pixel 0,0
            DfColour *srcPixel = &srcRow[x];
            r += srcPixel->r * weightY;
            g += srcPixel->g * weightY;
            b += srcPixel->b * weightY;

            // Pixel 1,0
            srcPixel += src->width;
            r += srcPixel->r * weightY2;
            g += srcPixel->g * weightY2;
            b += srcPixel->b * weightY2;

            dstPixel->r = r >> 8;
            dstPixel->g = g >> 8;
            dstPixel->b = b >> 8;
        }
    }

    for (int y = 0; y < dstH; y++) {
        DfColour *dstRow = &dst->pixels[y * dst->width];

        for (int x = dstW - 1; x; x--) {
            int srcXAndWeight = (x * widthRatio) >> 8;
            int srcX = srcXAndWeight >> 8;

            unsigned r = 0, g = 0, b = 0;
            unsigned weightX2 = srcXAndWeight & 0xFF;
            unsigned weightX = 256 - weightX2;

            // Pixel 0,0
            DfColour *srcPixel = &dstRow[srcX];
            r += srcPixel->r * weightX;
            g += srcPixel->g * weightX;
            b += srcPixel->b * weightX;

            // Pixel 0,1
            srcPixel++;
            r += srcPixel->r * weightX2;
            g += srcPixel->g * weightX2;
            b += srcPixel->b * weightX2;

            DfColour *dstPixel = &dstRow[x];
            dstPixel->r = r >> 8;
            dstPixel->g = g >> 8;
            dstPixel->b = b >> 8;
        }
    }
}

使用简单的可移植SIMD方案将时间缩短至16.5毫秒。该SIMD方案不使用类似SSE / AVX的专有指令集扩展。而是使用一个技巧,允许在32位整数中存储和操作红色和蓝色通道。它不像AVX实现那样快,但它具有简单性的好处。以下是代码:

void bilinear_interpolation(DfBitmap *src, DfBitmap *dst, float scale) {
    unsigned heightRatio = (double)(1<<8) * 255.0 / scale;
    unsigned widthRatio = (double)(1<<8) * 255.0 / scale;
    int dstH = scale * src->height;
    int dstW = scale * src->width;

    for (int y = 0; y < dstH; y++) {
        int srcYAndWeight = (y * heightRatio) >> 8;
        int srcY = srcYAndWeight >> 8;

        DfColour *dstPixel = &dst->pixels[y * dst->width];
        DfColour *srcRow = &src->pixels[srcY * src->width];

        unsigned weightY2 = srcYAndWeight & 0xFF;
        unsigned weightY = 256 - weightY2;

        for (int x = 0; x < src->width; x++, dstPixel++) {
            unsigned rb = 0, g = 0;

            // Pixel 0,0
            DfColour *srcPixel = &srcRow[x];
            rb += (srcPixel->c & 0xff00ff) * weightY;
            g += srcPixel->g * weightY;

            // Pixel 1,0
            srcPixel += src->width;
            rb += (srcPixel->c & 0xff00ff) * weightY2;
            g += srcPixel->g * weightY2;

            dstPixel->c = rb >> 8;
            dstPixel->g = g >> 8;
        }
    }

    for (int y = 0; y < dstH; y++) {
        DfColour *dstRow = &dst->pixels[y * dst->width];

        for (int x = dstW - 1; x; x--) {
            int srcXAndWeight = (x * widthRatio) >> 8;
            int srcX = srcXAndWeight >> 8;

            unsigned rb = 0, g = 0;
            unsigned weightX2 = srcXAndWeight & 0xFF;
            unsigned weightX = 256 - weightX2;

            // Pixel 0,0
            DfColour *srcPixel = &dstRow[srcX];
            rb += (srcPixel->c & 0xff00ff) * weightX;
            g += srcPixel->g * weightX;

            // Pixel 0,1
            srcPixel++;
            rb += (srcPixel->c & 0xff00ff) * weightX2;
            g += srcPixel->g * weightX2;

            DfColour *dstPixel = &dstRow[x];
            dstPixel->c = rb >> 8;
            dstPixel->g = g >> 8;
        }
    }
}

可以将X轴通道保持分离,但是合并Y轴通道。这样可以提高缓存一致性并使代码更简单。重新组合两个通道可以将时间缩短至14.6毫秒。以下是代码:

void bilinear_interpolation(DfBitmap *src, DfBitmap *dst, float scale) {
    unsigned heightRatio = (double)(1<<8) * 255.0 / scale;
    unsigned widthRatio = (double)(1<<8) * 255.0 / scale;
    int dstH = scale * src->height;
    int dstW = scale * src->width;

    for (int y = 0; y < dstH; y++) {
        int srcYAndWeight = (y * heightRatio) >> 8;
        int srcY = srcYAndWeight >> 8;

        DfColour *dstRow = &dst->pixels[y * dst->width];
        DfColour *srcRow = &src->pixels[srcY * src->width];

        unsigned weightY2 = srcYAndWeight & 0xFF;
        unsigned weightY = 256 - weightY2;

        for (int x = 0; x < src->width; x++) {
            unsigned rb = 0, g = 0;

            // Pixel 0,0
            DfColour *srcPixel = &srcRow[x];
            rb += (srcPixel->c & 0xff00ff) * weightY;
            g += srcPixel->g * weightY;

            // Pixel 1,0
            srcPixel += src->width;
            rb += (srcPixel->c & 0xff00ff) * weightY2;
            g += srcPixel->g * weightY2;

            dstRow[x].c = rb >> 8;
            dstRow[x].g = g >> 8;
        }

        for (int x = dstW - 1; x; x--) {
            unsigned rb = 0, g = 0;

            int srcXAndWeight = (x * widthRatio) >> 8;
            int srcX = srcXAndWeight >> 8;
            unsigned weightX2 = srcXAndWeight & 0xFF;
            unsigned weightX = 256 - weightX2;

            // Pixel 0,0
            DfColour *srcPixel = &dstRow[srcX];
            rb += (srcPixel->c & 0xff00ff) * weightX;
            g += srcPixel->g * weightX;

            // Pixel 0,1
            srcPixel++;
            rb += (srcPixel->c & 0xff00ff) * weightX2;
            g += srcPixel->g * weightX2;

            dstRow[x].c = rb >> 8;
            dstRow[x].g = g >> 8;
        }
    }
}

第二个内部循环可以使用查找表进行简化,因为每个扫描线的srcX、weightX和weightX2的值都是相同的。 在第二个内部循环中使用查找表可以将运行时间缩短至12.9毫秒。 以下是代码:

struct SrcXandWeights {
    uint8_t weightX, weightX2;
    uint16_t srcX;
};

void bilinear_interpolation(DfBitmap *src, DfBitmap *dst, float scale) {
    unsigned heightRatio = (double)(1<<8) * 255.0 / scale;
    unsigned widthRatio = (double)(1<<8) * 255.0 / scale;
    int dstH = scale * src->height;
    int dstW = scale * src->width;

    // Allocate look-up table.
    static SrcXandWeights *lut = NULL;
    static int lutSize = 0;
    if (lutSize < dstW) {
        delete [] lut;
        lut = new SrcXandWeights [dstW];
        lutSize = dstW;
    }

    // Populate look-up table.
    for (int x = 0; x < dstW; x++) {
        int srcXAndWeight = (x * widthRatio) >> 8;
        lut[x].srcX = srcXAndWeight >> 8;
        lut[x].weightX2 = srcXAndWeight & 0xFF;
        lut[x].weightX = 255 - lut[x].weightX2;
    }

    for (int y = 0; y < dstH; y++) {
        int srcYAndWeight = (y * heightRatio) >> 8;
        int srcY = (srcYAndWeight) >> 8;

        DfColour *dstRow = &dst->pixels[y * dst->width];
        DfColour *srcRow = &src->pixels[srcY * src->width];

        unsigned weightY2 = srcYAndWeight & 0xFF;
        unsigned weightY = 256 - weightY2;

        for (int x = 0; x < src->width; x++) {
            // Pixel 0,0
            DfColour *srcPixel = &srcRow[x];
            unsigned rb = (srcPixel->c & 0xff00ff) * weightY;
            unsigned g = srcPixel->g * weightY;

            // Pixel 1,0
            srcPixel += src->width;
            rb += (srcPixel->c & 0xff00ff) * weightY2;
            g += srcPixel->g * weightY2;

            dstRow[x].c = rb >> 8;
            dstRow[x].g = g >> 8;
        }

        for (int x = dstW - 1; x; x--) {
            SrcXandWeights *sw = lut + x;

            // Pixel 0,0
            DfColour *srcPixel = &dstRow[sw->srcX];
            unsigned rb = (srcPixel->c & 0xff00ff) * sw->weightX;
            unsigned g = srcPixel->g * sw->weightX;

            // Pixel 0,1
            srcPixel++;
            rb += (srcPixel->c & 0xff00ff) * sw->weightX2;
            g += srcPixel->g * sw->weightX2;

            dstRow[x].c = rb >> 8;
            dstRow[x].g = g >> 8;
        }
    }
}

目前代码仍然是单线程的。我的CPU有两个物理核心和总共4个线程。OpenCV在我的机器上使用了2个线程。我预计将代码转换为使用2个线程将减少时间到约7毫秒。

我不知道还需要什么其他技巧才能降低到4毫秒,尽管转换为真正的AVX SIMD实现可能是必要的。


0
也许有点晚了,但是还是要检查一下你是否在调试模式下运行应用程序。OpenCV是一个库,很可能是为发布而编译的 - 带有编译器优化。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接