压缩一组大整数

Question

压缩一组大整数

compressioninteger

15

我有一组整数，希望能够以最紧凑的方式进行表示。以下是我的约束和特征：

这是一个集合，或者说是一个独特整数列表，其中顺序不重要。
集合大小L相对较小（通常为1000个元素）
整数在0到N-1之间均匀分布，其中N相对较大（例如2^32）。
对压缩集的访问是随机的，但如果解压过程不那么快也没问题。
显然，压缩应该是无损的。

我尝试了一些方法，但对结果并不满意，而且我相信更好的解决方案存在：

差分编码（排序，然后编码差异），或者也可以是排序，然后对第i个元素和i*N/L之间的差异进行编码。两者都给出了合理的结果，但由于N和L的典型大小，效果并不是很好。对差分进行霍夫曼编码没有帮助，因为它们通常很大。
递归范围缩减( http://ygdes.com/ddj-3r/ddj-3r_compact.html )。这似乎很聪明，但在指数级减小的整数上效果最佳，而这里绝对不是这种情况。
stackoverflow上的一些讨论类似，但与我的问题不完全相同(C Library for compressing sequential positive integers，Compress sorted integers)。

如果您有任何想法，我将非常乐意听取。谢谢！

- doc

如果L是大约1000个基本随机元素，那么压缩可能比仅存储数字本身更加昂贵。 - msw

1

@msw 嗯，增量编码确实有帮助，例如（压缩率约为80%）。我有一种感觉，可以取得更好的效果。我必须强调的是，元素可能会被重新组织，特别是排序。 - doc

1

对于一个写得好的问题，点个赞。读者们也可能对Bloom过滤器感兴趣。 - Hans

3个回答

2

这个问题还没有解决吗？

我正在研究中。
（PS：我是一个游戏创作者而不是数学家）
几周来一直睡不好，因为我在想为什么我们不使用A^B+C变体（或其他方法）来压缩图像和信息。

我的乌托邦目标是通过计算机GPU创建的尽可能少的A^B+C公式组合来压缩460万位数字。基本上我试图做到这一点，因为它将允许在30fps下通过Wifi存储/流传小图像（<100字符），而不会丢失质量或占用过多带宽。

我的现实目标是将200位数字压缩到<5个字符。

PS：为了达到这个目标，我已经创建了“Base Chinais”。如果你想使用它：
- https://github.com/EloiStree/2019_09_19_MathCompressionOfImage/wiki/SouthChinais
- https://gitlab.com/eloistree/2019_09_06_UnicodeBasedId

Base(Chinais) 䶯 = 38727
它可以将2307^200+32450转换为碸^災+㔩。
如果你尝试使用原始的BigInteger进行压缩，基础Chinais提供4-4.5倍的压缩：
1413546486463454579816416416416462324833676542
4钉澻둲觋㷬乮䄠櫡䒤갱

现在我需要将<200位数字压缩到9999^9999+99999999。
如果你有任何关于A^B+C的想法或替代方案，请随时告诉我。
我正在Unity3D上进行实验，花费了很多时间。
我会在这里发布我找到的内容：
https://github.com/EloiStree/2019_09_19_MathCompressionOfImage/wiki

希望对下一个遇到这个问题的人有所帮助。

如果你想讨论这个问题，请在Discord上找到我。
https://eloistree.page.link/discord

- Eloi Stree

2

如果整数是随机的、不相关的，并且确实遵循 [0，2³²-1[ 上的均匀分布定律，那么可能可以证明无法从简单表示中压缩数组。我在你的问题中错过了什么吗？

对于非随机数字数组，我通常使用简单的deflate。这是一种常用的算法，因为它对于一般的、不完全随机的数组效果很好。当然，所有主要语言都有可调节压缩级别的优秀库，这也是一个优点。

我使用deflate来压缩小数组（大约300到2000个32位整数）的物理传感器测量值，并获得70%的收益，但这是因为连续的传感器测量很少非常不同。

找到适用于所有情况的明显更好的算法可能并不容易。大多数改进都来自于您的数字系列的特殊性。

您还可以注意到，通过将许多集合一起压缩，您可以获得更好的压缩收益。当然，这可能非常不方便，具体取决于您的应用程序。

- Denys Séguret

感谢您的回答。我知道熵压缩技术，例如deflate，但它们似乎没有考虑数据可能被重新排列以改善压缩效果这一事实。 - doc

你的意思是可以改变数据的顺序，而且它确实是一个集合而不是数组？如果是这样，那么"压缩集合中元素的访问是随机的"是什么意思？ - Denys Séguret

确切地说（如果没有表达清楚，抱歉）。我的意思是，一旦集合被压缩，查询的类型是针对一个特定元素，而不是整个集合；但是正如我所说的，如果必须要解压整个集合才能查询该特定元素也可以。你可能认为“查询”集合中的元素是没用的，但你可以将其视为关联数组，其中整数是键。这并不是非常重要。 - doc

由于这是集合成员的随机选择，而不是整数的随机数组，因此实际上可以实现明显的压缩。大约可减少25％。请参见我的答案。 - Mark Adler

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mark Adler · Accepted Answer

你可以通过计数来了解最佳方案。（我希望stackoverflow允许像math.stackexchange一样使用TeX公式。无论如何...）

ceiling(log(Combination(2^32,1000)) / (8 * log(2))) = 2934

如果像你说的那样，选择是均匀分布的，那么在这种情况下，你所能期望的最佳压缩平均值为2934字节。最佳比率是未编码表示的73.35%。 Combination(2^32,1000)是压缩算法可能输入的总数。如果它们是均匀分布的，则最优编码是一个巨大的整数，通过索引标识每个可能的输入。每个巨大的整数值唯一地标识一个输入。想象一下在巨大的表格中通过索引查找输入。ceiling(log(Combination(2^32,1000)) / log(2))是你需要用于该索引整数的位数。

更新：我找到了一种方法，可以使用现成的压缩工具接近理论最佳效果。我排序，应用增量编码，并从中减去1（因为连续不同元素之间的增量至少为1）。然后，技巧在于我将所有高字节写出，然后是下一个最重要的字节等等。增量的高字节减一倾向于为零，因此将许多零组合在一起，这是标准压缩工具所喜欢的。另外，下一个字节集往往偏低值。

对于示例（从0..2^32-1中获取1000个均匀且不同的样本），当通过gzip -9运行时，我得到了平均3110字节，并且通过xz -9得到了3098字节（xz使用与7zip相同的LZMA压缩）。这些数字非常接近理论最佳平均值2934。此外，gzip的开销为18字节，而xz的开销为24字节，都用于头和尾。因此，与理论最佳值的公正比较应该是gzip -9的3092和xz -9的3074。比理论最佳值大约5％。

更新2：

我实现了排列的直接编码，并获得了平均2974字节，仅比理论最佳值多了1％左右。我使用GNU多精度算术库在一个巨大的整数中为每个排列编码了一个索引。下面显示了编码和解码的实际代码。我添加了注释，以说明mpz_*函数执行的算术操作可能不明显。

/* Recursively code the members in set[] between low and high (low and high
   themselves have already been coded).  First code the middle member 'mid'.
   Then recursively code the members between low and mid, and then between mid
   and high. */
local void combination_encode_between(mpz_t pack, mpz_t base,
                                      const unsigned long *set,
                                      int low, int high)
{
    int mid;

    /* compute the middle position -- if there is nothing between low and high,
       then return immediately (also in that case, verify that set[] is sorted
       in ascending order) */
    mid = (low + high) >> 1;
    if (mid == low) {
        assert(set[low] < set[high]);
        return;
    }

    /* code set[mid] into pack, and update base with the number of possible
       set[mid] values between set[low] and set[high] for the next coded
       member */
        /* pack += base * (set[mid] - set[low] - 1) */
    mpz_addmul_ui(pack, base, set[mid] - set[low] - 1);
        /* base *= set[high] - set[low] - 1 */
    mpz_mul_ui(base, base, set[high] - set[low] - 1);

    /* code the rest between low and high */
    combination_encode_between(pack, base, set, low, mid);
    combination_encode_between(pack, base, set, mid, high);
}

/* Encode the set of integers set[0..num-1], where each element is a unique
   integer in the range 0..max.  No value appears more than once in set[]
   (hence the name "set").  The elements of set[] must be sorted in ascending
   order. */
local void combination_encode(mpz_t pack, const unsigned long *set, int num,
                              unsigned long max)
{
    mpz_t base;

    /* handle degenerate cases and verify last member <= max -- code set[0]
       into pack as simply itself and set base to the number of possible set[0]
       values for coding the next member */
    if (num < 1) {
            /* pack = 0 */
        mpz_set_ui(pack, 0);
        return;
    }
        /* pack = set[0] */
    mpz_set_ui(pack, set[0]);
    if (num < 2) {
        assert(set[0] <= max);
        return;
    }
    assert(set[num - 1] <= max);
        /* base = max - num + 2 */
    mpz_init_set_ui(base, max - num + 2);

    /* code the last member of the set and update base with the number of
       possible last member values */
        /* pack += base * (set[num - 1] - set[0] - 1) */
    mpz_addmul_ui(pack, base, set[num - 1] - set[0] - 1);
        /* base *= max - set[0] */
    mpz_mul_ui(base, base, max - set[0]);

    /* encode the members between 0 and num - 1 */
    combination_encode_between(pack, base, set, 0, num - 1);
    mpz_clear(base);
}

/* Recursively decode the members in set[] between low and high (low and high
   themselves have already been decoded).  First decode the middle member
   'mid'. Then recursively decode the members between low and mid, and then
   between mid and high. */
local void combination_decode_between(mpz_t unpack, unsigned long *set,
                                      int low, int high)
{
    int mid;
    unsigned long rem;

    /* compute the middle position -- if there is nothing between low and high,
       then return immediately */
    mid = (low + high) >> 1;
    if (mid == low)
        return;

    /* extract set[mid] as the remainder of dividing unpack by the number of
       possible set[mid] values, update unpack with the quotient */
        /* div = set[high] - set[low] - 1, rem = unpack % div, unpack /= div */
    rem = mpz_fdiv_q_ui(unpack, unpack, set[high] - set[low] - 1);
    set[mid] = set[low] + 1 + rem;

    /* decode the rest between low and high */
    combination_decode_between(unpack, set, low, mid);
    combination_decode_between(unpack, set, mid, high);
}

/* Decode from pack the set of integers encoded by combination_encode(),
   putting the result in set[0..num-1].  max must be the same value used when
   encoding. */
local void combination_decode(const mpz_t pack, unsigned long *set, int num,
                              unsigned long max)
{
    mpz_t unpack;
    unsigned long rem;

    /* handle degnerate cases, returning the value of pack as the only element
       for num == 1 */
    if (num < 1)
        return;
    if (num < 2) {
            /* set[0] = (unsigned long)pack */
        set[0] = mpz_get_ui(pack);
        return;
    }

    /* extract set[0] as the remainder after dividing pack by the number of
       possible set[0] values, set unpack to the quotient */
    mpz_init(unpack);
        /* div = max - num + 2, set[0] = pack % div, unpack = pack / div */
    set[0] = mpz_fdiv_q_ui(unpack, pack, max - num + 2);

    /* extract the last member as the remainder after dividing by the number
       of possible values, taking into account the first member -- update
       unpack with the quotient */
        /* rem = unpack % max - set[0], unpack /= max - set[0] */
    rem = mpz_fdiv_q_ui(unpack, unpack, max - set[0]);
    set[num - 1] = set[0] + 1 + rem;

    /* decode the members between 0 and num - 1 */
    combination_decode_between(unpack, set, 0, num - 1);
    mpz_clear(unpack);
}

有mpz_*函数可用于将数字写入文件并读取回来，或者将数字导出到指定的内存格式中，并将其导入回来。