有没有可能加快双线性插值的速度？

Question

有没有可能加快双线性插值的速度？

c++cimageperformanceimage-scaling

3

首先，我想为您提供一些背景信息。

我有两种图像需要合并。第一张图片是背景图像，格式为8BppGrey，分辨率为320x240。第二张图片是前景图像，格式为32BppRGBA，分辨率为64x48。

更新 MVP的github存储库在问题底部。

为了实现这一点，我使用双线性插值将第二个图像调整大小到与第一个图像相同的大小，然后使用混合将两个图像合并成一个图像。只有当第二个图像的alpha值大于0时才会进行混合。

我需要尽快完成它，所以我的想法是将调整大小和合并/混合过程结合起来。

为了实现这一点，我使用writeablebitmapex repository中的调整大小函数，并添加了合并/混合功能。

一切都按预期工作，但我想减少执行时间。

这是当前的调试时间：

// CPU: Intel(R) Core(TM) i7-4810MQ CPU @ 2.80GHz

MediaServer: Execution time in c++ 5 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 5 ms
MediaServer: Resizing took 5 ms.
MediaServer: Execution time in c++ 4 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 3 ms
MediaServer: Resizing took 3 ms.
MediaServer: Execution time in c++ 4 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 5 ms
MediaServer: Resizing took 4 ms.
MediaServer: Execution time in c++ 6 ms
MediaServer: Resizing took 6 ms.
MediaServer: Execution time in c++ 3 ms
MediaServer: Resizing took 3 ms.

我有没有机会增加调整大小/合并/混合过程的性能并降低执行时间？

是否有一些部分可以并行化处理？

我是否有机会使用一些处理器功能？

嵌套循环是一个巨大的性能损失，但我不知道如何编写更好的代码。

我希望整个过程达到1或2毫秒。这可行吗？

下面是我使用的修改后的Visual C++函数：

pd是我用于在WPF中显示结果的可写位图的后备缓冲区。我使用的格式是默认的32BppRGBA。
pixels是64x48 32BppRGBA图像的int[]数组
widthSource和heightSource是像素图像的大小
宽度和高度是输出图像的目标大小
baseImage是320x240 8BppGray图像的int[]数组

VC++代码：

unsigned int Resize(int* pd, int* pixels, int widthSource, int heightSource, int width, int height, byte* baseImage)
{
    unsigned int start = clock();

    float xs = (float)widthSource / width;
    float ys = (float)heightSource / height;

    float fracx, fracy, ifracx, ifracy, sx, sy, l0, l1, rf, gf, bf;
    int c, x0, x1, y0, y1;
    byte c1a, c1r, c1g, c1b, c2a, c2r, c2g, c2b, c3a, c3r, c3g, c3b, c4a, c4r, c4g, c4b;
    byte a, r, g, b;

    // Bilinear
    int srcIdx = 0;

    for (int y = 0; y < height; y++)
    {
        for (int x = 0; x < width; x++)
        {
            sx = x * xs;
            sy = y * ys;
            x0 = (int)sx;
            y0 = (int)sy;

            // Calculate coordinates of the 4 interpolation points
            fracx = sx - x0;
            fracy = sy - y0;
            ifracx = 1.0f - fracx;
            ifracy = 1.0f - fracy;
            x1 = x0 + 1;
            if (x1 >= widthSource)
            {
                x1 = x0;
            }
            y1 = y0 + 1;
            if (y1 >= heightSource)
            {
                y1 = y0;
            }

            // Read source color
            c = pixels[y0 * widthSource + x0];
            c1a = (byte)(c >> 24);
            c1r = (byte)(c >> 16);
            c1g = (byte)(c >> 8);
            c1b = (byte)(c);

            c = pixels[y0 * widthSource + x1];
            c2a = (byte)(c >> 24);
            c2r = (byte)(c >> 16);
            c2g = (byte)(c >> 8);
            c2b = (byte)(c);

            c = pixels[y1 * widthSource + x0];
            c3a = (byte)(c >> 24);
            c3r = (byte)(c >> 16);
            c3g = (byte)(c >> 8);
            c3b = (byte)(c);

            c = pixels[y1 * widthSource + x1];
            c4a = (byte)(c >> 24);
            c4r = (byte)(c >> 16);
            c4g = (byte)(c >> 8);
            c4b = (byte)(c);

            // Calculate colors
            // Alpha
            l0 = ifracx * c1a + fracx * c2a;
            l1 = ifracx * c3a + fracx * c4a;
            a = (byte)(ifracy * l0 + fracy * l1);

            // Write destination
            if (a > 0)
            {
                // Red
                l0 = ifracx * c1r + fracx * c2r;
                l1 = ifracx * c3r + fracx * c4r;
                rf = ifracy * l0 + fracy * l1;

                // Green
                l0 = ifracx * c1g + fracx * c2g;
                l1 = ifracx * c3g + fracx * c4g;
                gf = ifracy * l0 + fracy * l1;

                // Blue
                l0 = ifracx * c1b + fracx * c2b;
                l1 = ifracx * c3b + fracx * c4b;
                bf = ifracy * l0 + fracy * l1;

                // Cast to byte
                float alpha = a / 255.0f;
                r = (byte)((rf * alpha) + (baseImage[srcIdx] * (1.0f - alpha)));
                g = (byte)((gf * alpha) + (baseImage[srcIdx] * (1.0f - alpha)));
                b = (byte)((bf * alpha) + (baseImage[srcIdx] * (1.0f - alpha)));

                pd[srcIdx++] = (255 << 24) | (r << 16) | (g << 8) | b;
            }
            else
            {
                // Alpha, Red, Green, Blue                          
                pd[srcIdx++] = (255 << 24) | (baseImage[srcIdx] << 16) | (baseImage[srcIdx] << 8) | baseImage[srcIdx];
            }
        }
    }

    unsigned int end = clock() - start;
    return end;
}

- datoml

1

C还是C++？虽然你在写C++，但初看代码看起来很像C。决定使用一种语言。 - user2371524

3

你混淆了C语言和C++语言，它们是不同的编程语言。微软公司提供的"Visual C++"只有一个（质量较差且有些隐蔽的）C编译器，这增加了人们将"C代码写成C++代码"的混淆可能性。 - user2371524

2

@Grantly：编译器肯定会折叠这样的常量表达式。 - M Oehm

2

@datoml：如果是这样，你可能不需要担心向量化（手动或自动）-只需尝试使用启用了优化的发布版本运行-您应该会看到显着的速度提升。 - Paul R

1

@PaulR 哎呀...我讨厌微软。我看到一篇帖子说，托管 C++ 的编译器在优化等方面非常糟糕。所以我把这个函数转移到了我的本地 C++ 库中，并将托管部分仅用作我的 C# 代码的包装器。现在整个过程只需要1毫秒 :O。太棒了。 - datoml

显示剩余18条评论

3个回答

0

使用双线性插值加速调整大小的常见方法是：

利用 x0 和 fracx 与行无关，y0 和 fracy 与列无关的事实。即使您没有将 y0 和 fracy 的计算从 x 循环中提取出来，编译器优化也应该会处理。但是，对于 x0 和 fracx，需要预先计算所有列的值并将它们存储在数组中。与未进行预计算的 O(width*height) 相比，计算 x0 和 fracx 的复杂度变为 O(width)。
通过将浮点运算替换为整数运算，并使用移位运算而不是整数除法，以整数方式进行整个处理。

为了更好的可读性，以下代码中我没有实现 x0 和 fracx 的预计算。预计算很容易。

请注意，FACTOR = 2048 是在此处使用32位有符号整数的最大值（2048 * 2048 * 255是完全可以的）。为了获得更高的精度，您应该切换到int64_t，然后分别增加FACTOR和SHIFT。

我将边界检查放入内部循环中以提高可读性。对于优化实现，应该通过在这种情况发生之前在两个循环中迭代来删除它，并为边框像素添加特殊处理。

如果有人想知道+ (FACTOR * FACTOR / 2)是用来做什么的，那就是与随后的除法一起进行四舍五入。

最后请注意，(FACTOR * FACTOR / 2)和2 * SHIFT在编译时计算。

#define FACTOR      2048
#define SHIFT       11

const int xs = (int) ((double) FACTOR * widthSource / width + 0.5);
const int ys = (int) ((double) FACTOR * heightSource / height + 0.5);

for (int y = 0; y < height; y++)
{
    const int sy = y * ys;
    const int y0 = sy >> SHIFT;
    const int fracy = sy - (y0 << SHIFT);

    for (int x = 0; x < width; x++)
    {
        const int sx = x * xs;
        const int x0 = sx >> SHIFT;
        const int fracx = sx - (x0 << SHIFT);

        if (x0 >= widthSource - 1 || y0 >= heightSource - 1)
        {
            // insert special handling here
            continue;
        }

        const int offset = y0 * widthSource + x0;

        target[y * width + x] = (unsigned char)
            ((source[offset] * (FACTOR - fracx) * (FACTOR - fracy) +
            source[offset + 1] * fracx * (FACTOR - fracy) +
            source[offset + widthSource] * (FACTOR - fracx) * fracy +
            source[offset + widthSource + 1] * fracx * fracy +
            (FACTOR * FACTOR / 2)) >> (2 * SHIFT));
    }
}

为了澄清，与OP使用的变量相匹配，例如，在阿尔法通道的情况下：

a = (unsigned char)
    ((c1a * (FACTOR - fracx) * (FACTOR - fracy) +
    c2a * fracx * (FACTOR - fracy) +
    c3a * (FACTOR - fracx) * fracy +
    c4a * fracx * fracy +
    (FACTOR * FACTOR / 2)) >> (2 * SHIFT));

- Pedro

谢谢您的回答。这段代码是优化代码的框架吗？我需要在特殊处理块中添加着色吗？ - datoml

这是一个用于8位图像/通道的双线性插值代码，适用于任何类型（灰度、RGB之一等）。添加更多通道很简单。当然，sy,y0,fracy和sx,x0,fracx对于所有通道都是有效的。 - Pedro

更准确地说，这是一个使用双线性插值进行优化的整数算术调整大小的代码。 - Pedro

哦，好的。我的输出和输入图像是32BppRGBA格式的。那么我需要扩展代码以支持这种格式吗？ - datoml

请注意您的 alpha 通道。它是：c1a = source[offset]，c2a = source[offset + 1]，c3a = source[offset + widthSource]，c4a = source[offset + widthSource + 1]，a = target[y * width + x]。将其与代码进行比较，您会发现。 - Pedro

显示剩余2条评论

0

感谢所有的帮助，但问题出在了托管的 C++ 项目上。我现在将函数转移到了本地的 C++ 库中，并仅将托管的 C++ 部分用作 C# 应用程序的包装器。

编译器优化后，该函数现在只需要1毫秒即可完成。

编辑：

我将自己的答案标记为解决方案，因为 @marom 的优化导致图像损坏。

- datoml

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- marom · Accepted Answer

加速代码的一项措施是避免整数与浮点数之间的类型转换。这可以通过使用适当范围内的整数值来替代0..1范围内的浮点数来实现。

类似于以下内容：

for (int y = 0; y < height; y++)
{
    for (int x = 0; x < width; x++)
    {
        int sx1 = x * widthSource ;
        int x0 = sx1 / width;
        int fracx = (sx1 % width) ; // range 0..width - 1

转化成类似于以下内容

        l0 = (fracx * c2a + (width - fracx) * c1a) / width ;

等等，有点棘手但还是可以做到的。

涉及一些技术方面的内容，但是可以理解。