如何加速嵌套查找操作的循环？

Question

如何加速嵌套查找操作的循环？

pythonperformancenumpyopencvnested-loops

3

我正在编程激光雕刻图像的半色调处理。在给定的设置下，激光只能打开或关闭，因此我可以给它具有1位深度的二进制图像。因此，我将具有8位深度（0到255）的灰度图像转换为具有1位深度（0到1）的二进制图像。

我在下面包含了两个示例图像。左边是灰度图像。右边是用3x3的二进制像素替换每个像素的结果。结果看起来相似，因为灰色来自黑色像素的密度。

我的当前尝试使用嵌套循环来访问像素，并在输出图像中用字典中查找的值替换像素：

import math
import time

import numpy as np

TONES = [[0, 0,
          0, 0],
         [0, 1,
          0, 0],
         [1, 1,
          0, 0],
         [1, 1,
          0, 1],
         [1, 1,
          1, 1]]

def process_tones():
    """Converts the tones above to the right shape."""
    tones_dict = dict()

    for t in TONES:
        brightness = sum(t)
        bitmap_tone = np.reshape(t, (2, 2)) * 255
        tones_dict[brightness] = bitmap_tone
    return(tones_dict)

def halftone(gray, tones_dict):
    """Generate a new image where each pixel is replaced by one with the values in tones_dict.
    """

    num_rows = gray.shape[0]
    num_cols = gray.shape[1]
    num_tones = len(tones_dict)
    tone_width = int(math.sqrt(num_tones - 1))

    output = np.zeros((num_rows * tone_width, num_cols * tone_width),
                         dtype = np.uint8)

    # Go through each pixel
    for i in range(num_rows):
        i_output = range(i * tone_width, (i + 1)* tone_width)

        for j in range(num_cols):
            j_output = range(j * tone_width, (j + 1)* tone_width)

            pixel = gray[i, j]
            brightness = int(round((num_tones - 1) * pixel / 255))

            output[np.ix_(i_output, j_output)] = tones_dict[brightness]

    return output

def generate_gray_image(width = 100, height = 100):
    """Generates a random grayscale image.
    """

    return (np.random.rand(width, height) * 256).astype(np.uint8)

gray = generate_gray_image()
tones_dict = process_tones()

start = time.time()
for i in range(10):
    binary = halftone(gray, tones_dict = tones_dict)
duration = time.time() - start
print("Average loop time: " + str(duration))

结果为：

平均循环时间：3.228989839553833

对于一个100x100的图像，平均循环需要3秒，与OpenCV的函数相比似乎有些长。

我查看了如何加速Python嵌套循环？和在图像中循环像素，但我没有立即看到如何矢量化此操作。

如何加速这个查找操作的嵌套循环？

- miguelmorin

2

你可以使用编译语言。Python并不是万能的最佳工具。 - TomServo

4

请提供一些样本数据和期望输出。[MCVE] （注：MCVE是“Minimal, Complete, and Verifiable Example”的缩写，意思是“最小、完整和可验证的示例”，用于描述程序设计问题时要提供足够的信息以便其他人能够重现该问题） - Alexander

2

请提供样本数据和期望输出。如果您使用numpy，则您的要求甚至不需要嵌套for循环。 - pissall

1

除了一些示例数据外，更详细地解释您要做什么会有所帮助。有许多方法可以“根据像素的灰度值填充输出数组以查找值”。 - jirassimok

1

如果你想要更快的速度，我建议删除包含print语句的if块。 - C.Nivs

显示剩余3条评论

3个回答

1

使用纯numpy可以很快地解决这个问题。

首先以向量化的方式计算brightness。
接下来使用brightness索引tones，将gray转换为形状为HxWx2x2的4D数组。
使用np.transpose重新组织数组，将tones中引入的维度与gray中原始维度交错。图像被转换为Hx2xWx2。
“展平/合并”垂直维度（从gray中的H和tone中的2），对水平维度（从gray中的W和tone中的2）执行相同操作。这个操作通过重塑为(H*2)x(W*2)完成。

请将以下代码粘贴到问题代码下方并运行。

def process_tones2():
    tones = np.array(TONES, dtype='u1')
    size = int(np.sqrt(tones.shape[-1]))
    tones = 255 * tones.reshape(-1, size, size)
    bins = tones.sum(axis=(-2,-1), dtype=int) // size ** 2
    iperm = np.argsort(bins)
    return bins[iperm], tones[iperm]

def halftone_fast(gray, bins, tones):
    height, width = gray.shape
    tone_height, tone_width = tones.shape[-2:]
    brightness = np.round(gray / 255 * (len(tones) - 1)).astype('u1')
    binary4d = tones[brightness]
    binary4d = binary4d.transpose((0,2,1,3))
    binary = binary4d.reshape(height * tone_height, width * tone_width)
    return binary

bins, tones = process_tones2()
start = time.time()
for i in range(10):
    binary2 = halftone_fast(gray, bins, tones)
duration = time.time() - start
print("Average loop time: " + str(duration))
print("Error:", np.linalg.norm(binary.astype(float) - binary2))

在我的电脑上，我得到了以下结果：

Average loop time: 2.3393328189849854
Average loop time: 0.0032405853271484375
Error: 0.0

加速约为1000倍。

请注意，在halftone_fast()中未使用参数bins。原因是它对半色调处理不需要。只有当TONES从0开始，以所有亮度级结束时，才能在问题的代码中起作用。因此，brightness作为tones排序列表中的索引。

如果映射不是线性的，则必须使用np.digitize(gray, bins)来计算tones数组中的正确索引。

- tstanisl

很棒的解决方案！ - Alexander

1

你的算法似乎有两个部分：计算每个像素的“亮度”，并用半色调点替换像素。

首先，我会假设输入图像的形状为（h，w）。

grayscale = np.array(...)
h, w = grayscale.shape

亮度级别

计算亮度需要两个步骤：

Determine the bounds for each brightness level. This can be achieved by using np.linspace to divides the range [0, 256) into num_tones equal-sized chunks.
```
bins = np.linspace(0, 256, num_tones + 1)
# e.g. with 4 tones: [0, 64, 128, 192, 256]
```
Determine which level each pixel falls in. This can be achieved using np.digitize.
```
# (subtract 1 because digitize counts from 1)
levels = np.digitize(grayscale, bins) - 1  # shape (h, w)
```
Then levels[i, j] is the brightness level of grayscale[i,j] (from 0 to num_tones, inclusive).

半色调

现在您已经获得了每个像素的亮度级别，您可以使用这些作为键来获取它们的半色调矩阵。为了尽可能简单，您需要将半色调存储为一个Numpy数组，而不是一个字典。

tones = np.array(...)  # shape(num_tones, x, y)
x, y = tones.shape[1:]

通过使用图像的亮度级别作为索引数组¹，用于tones，您可以获得每个像素的半色调矩阵。

halftones = tones[levels]  # shape (h, w, x, y)
# halftones[i, j] is the halftone for grayscale[i, j]

然后只需要将元素排序并扁平化数组即可。

# Reorder axes so halftone rows are before image columns
ordered = halftones.swapaxes(1, 2)  # shape (h, x, w, y)

# Make it 2-dimensional
result = ordered.reshape(h * x, w * y)

速度

我写了一个脚本来比较原始代码、我的答案和tstanisl的答案的速度。结果如下：

Best times
halftone:      0.346237126000000
np_halftone:   0.000565907715000
halftone_fast: 0.000437084295000

两个答案的运行速度都比原始代码快几百倍（我的是600，tstanisl的是800），tstanisl的效果比我的好约30%。

为了换取这种速度，我的功能在tstanisl和原始功能上有一个小优点：如果您想使用自定义色调，这些色调没有直接对应于它们的亮度总值，此算法仍将起作用（例如，如果您想反转半色调中的颜色）。否则，tstanisl的效率更高。

¹ Numpy用户指南中链接部分的最后一个示例实际上与此非常相似，它讨论了将图像颜色值映射到RGB三元组的问题。

- jirassimok

很好的方法，使用4维数组然后交换轴。比我显式堆叠要简洁得多。 - Dan Mašek

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Dan Mašek · Accepted Answer

译文：关键是不要像你现在这样迭代得如此细致，而是将大部分工作转移到经过优化的NumPy函数中。

从概念上来说，我们可以将输出图像视为一组较小的图像（称之为“通道”），每个通道保存半色调网格中一个位置的数据。

个别通道图像可以通过简单的查找生成，使用Numpy可以通过用灰度图像（即LUT [image]）进行{{link1：索引}}查找表来完成。

查找表

假设我们以以下方式定义“瓦片大小”（半色调图案的大小）和个别色调瓷砖：

TILE_SIZE = (2, 2) # Rows, Cols

TONES = np.array(
    [[0, 0,
      0, 0],
     [0, 1,
      0, 0],
     [1, 1,
      0, 0],
     [1, 1,
      0, 1],
     [1, 1,
      1, 1]]
    , dtype=np.uint8) * 255

我们首先使用np.linspace计算灰度和色调索引之间的映射关系。然后对于每个位置，我们根据色调的定义创建查找表（使用查找技术来实现）。

def generate_LUTs(tones, tile_size):
    num_tones, num_tiles = tones.shape
    tile_rows, tile_cols = tile_size
    assert(num_tiles == (tile_rows * tile_cols))

    # Generate map between grayscale value and tone index
    gray_level = np.linspace(0, (num_tones - 1), 256, dtype=np.float32)
    tone_map = np.uint8(np.round(gray_level))

    # Generate lookup tables for each tile
    LUTs = []
    for tile in range(num_tiles):
        LUTs.append(tones[:,tile][tone_map])

    return LUTs

合并通道

现在，将通道合并成完整的输出图像。

第一步是对每个通道图像进行reshape，使其仅有一列。

然后，我们可以使用np.hstack将所有共享相同半色调模式行的通道图像组合在一起。

接下来，我们重新塑造结果，使其具有与输入图像相同的行数（即它们现在将有两倍的列数）。

我们再次使用np.hstack组合所有重新塑形的图像。

最后，我们重新调整结果，使其具有正确的行数（根据瓷砖大小），然后就完成了。

在代码中（适用于任何瓷砖大小）：

def halftone(image, LUTs, tile_size):
    tiles = []
    for tile in range(len(LUTs)):
        tiles.append(LUTs[tile][image])

    image_rows, _ = image.shape
    tile_rows, tile_cols = tile_size

    merged_rows = []
    for row in range(tile_rows):
        row_tiles = tiles[row * tile_cols:(row + 1) * tile_cols]
        merged_row = np.hstack([row_tile.reshape(-1, 1) for row_tile in row_tiles])
        merged_rows.append(merged_row.reshape(image_rows, -1))

    return np.hstack(merged_rows).reshape(image_rows * tile_rows, -1)

例句使用：

LUTs = generate_LUTs(TONES, TILE_SIZE)
binary = halftone(gray, LUTs, TILE_SIZE)

示例输出：

使用3x3的瓷砖：