算法挑战:任意就地基数转换用于无损字符串压缩

5

为了更好地理解,我们可以用一个实际的例子来说明。假设我正在编写一个由MongoDB支持的Web应用程序,因此我的记录具有长十六进制主键,使得查看记录的URL看起来像/widget/55c460d8e2d6e59da89d08d0。这似乎过于冗长了。URL可以使用比这更多的字符。虽然在24位十六进制数字中只有不到8 x 10^2816^24)个可能的值,但仅限于与[a-zA-Z0-9]正则表达式类匹配的字符(YouTube视频ID使用更多),62个字符,你只需要17个字符就可以超过8 x 10^28

我想要一个算法,可以将任何仅限于特定字符字母表的字符串转换为另一个具有另一个字符字母表的字符串,其中每个字符的值可以被认为是alphabet.indexOf(c)

算法大致如下:

convert(value, sourceAlphabet, destinationAlphabet)

前提条件

  • 所有参数都是字符串。
  • value 中的每个字符都存在于 sourceAlphabet 中。
  • sourceAlphabetdestinationAlphabet 中的每个字符都是唯一的。

简单示例

var hex = "0123456789abcdef";
var base10 = "0123456789";
var result = convert("12245589", base10, hex); // result is "bada55";

但我也希望它能够将俄文字母和一些标点符号的《战争与和平》转换为整个Unicode字符集,然后再无损地转换回来。
这是否可能呢?
在计算机科学101中,我学过的唯一一种进行基本转换的方法是先通过求和digit * base^position将其转换为十进制整数,然后再反向转换到目标基数。这种方法对于非常长的字符串的转换是不够的,因为整数变得太大了。
从直觉上讲,基本转换可以在原地完成,当您遍历字符串时(可能要向后遍历以保持标准有效数字顺序),以某种方式跟踪余数,但我不够聪明,无法解决这个问题。
这就是你们出现的地方,StackOverflow。你们够聪明吗?
也许这是一个已经解决的问题,由18世纪的数学家在纸上完成,1970年在LISP上通过打孔卡实现,并成为密码学101中的第一个作业任务,但我的搜索没有结果。
我更喜欢用JavaScript以函数式风格解决问题,但任何语言或风格都可以,只要你不使用一些大整数库。当然,效率越高,加分越多。
请不要批评原始示例。解决问题的一般技能比任何应用解决方案更重要。

1
标题说“原地”。我认为当移动到字符数少于原始字符的字母表时,并不一定可能实现。 - Erick G. Hagstrom
1
不过这也不是真的被认为是可能的 -- 如果是的话,算术解码会变得容易得多。 - David Eisenstat
1
当然,我认为这是一个很好的简化。 - Erik R.
1
类似的加密方式已经在某些特殊形式中得到了应用。一个字节可以有2^8或256个不同的值,但其中不到一半的值代表可打印字符,并且打印时看起来不像一场严重的车祸。因此,Base64定义了一个由64个“字母”组成的字符集,并将比特串分成6位块,而不是像字节中的8位那样。你可以手动地按5位块进行分割,并使用a-z和数字0-5作为一个例子。你的挑战比这些特殊形式更普遍,但我认为这是可能的。 - WDS
1
对于您的用例,您可以按相反的顺序(最不重要的数字在前)解释输入,对吧?如果是这样,我认为它应该可以工作。但是这个空间太小了,无法证明 :-) - Thomas Mueller
显示剩余5条评论
3个回答

3
这里有一个使用位移操作非常快的C语言解决方案,假定您知道解码后字符串的长度。这些字符串是整数向量,范围为0..每个字母表的最大值。用户需要自行将其转换为字符范围受限的字符串。至于问题标题中的“原地”,源向量和目标向量可以重叠,但仅当源字母表不大于目标字母表时才可以。
/*
  recode version 1.0, 22 August 2015

  Copyright (C) 2015 Mark Adler

  This software is provided 'as-is', without any express or implied
  warranty.  In no event will the authors be held liable for any damages
  arising from the use of this software.

  Permission is granted to anyone to use this software for any purpose,
  including commercial applications, and to alter it and redistribute it
  freely, subject to the following restrictions:

  1. The origin of this software must not be misrepresented; you must not
     claim that you wrote the original software. If you use this software
     in a product, an acknowledgment in the product documentation would be
     appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be
     misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.

  Mark Adler
  madler@alumni.caltech.edu
*/

/* Recode a vector from one alphabet to another using intermediate
   variable-length bit codes. */

/* The approach is to use a Huffman code over equiprobable alphabets in two
   directions.  First to encode the source alphabet to a string of bits, and
   second to encode the string of bits to the destination alphabet. This will
   be reasonably close to the efficiency of base-encoding with arbitrary
   precision arithmetic. */

#include <stddef.h>     // size_t
#include <limits.h>     // UINT_MAX, ULLONG_MAX

#if UINT_MAX == ULLONG_MAX
#  error recode() assumes that long long has more bits than int
#endif

/* Take a list of integers source[0..slen-1], all in the range 0..smax, and
   code them into dest[0..*dlen-1], where each value is in the range 0..dmax.
   *dlen returns the length of the result, which will not exceed the value of
   *dlen when called.  If the original *dlen is not large enough to hold the
   full result, then recode() will return non-zero to indicate failure.
   Otherwise recode() will return 0.  recode() will also return non-zero if
   either of the smax or dmax parameters are less than one.  The non-zero
   return codes are 1 if *dlen is not long enough, 2 for invalid parameters,
   and 3 if any of the elements of source are greater than smax.

   Using this same operation on the result with smax and dmax reversed reverses
   the operation, restoring the original vector.  However there may be more
   symbols returned than the original, so the number of symbols expected needs
   to be known for decoding.  (An end symbol could be appended to the source
   alphabet to include the length in the coding, but then encoding and decoding
   would no longer be symmetric, and the coding efficiency would be reduced.
   This is left as an exercise for the reader if that is desired.) */
int recode(unsigned *dest, size_t *dlen, unsigned dmax,
           const unsigned *source, size_t slen, unsigned smax)
{
    // compute sbits and scut, with which we will recode the source with
    // sbits-1 bits for symbols < scut, otherwise with sbits bits (adding scut)
    if (smax < 1)
        return 2;
    unsigned sbits = 0;
    unsigned scut = 1;          // 2**sbits
    while (scut && scut <= smax) {
        scut <<= 1;
        sbits++;
    }
    scut -= smax + 1;

    // same thing for dbits and dcut
    if (dmax < 1)
        return 2;
    unsigned dbits = 0;
    unsigned dcut = 1;          // 2**dbits
    while (dcut && dcut <= dmax) {
        dcut <<= 1;
        dbits++;
    }
    dcut -= dmax + 1;

    // recode a base smax+1 vector to a base dmax+1 vector using an
    // intermediate bit vector (a sliding window of that bit vector is kept in
    // a bit buffer)
    unsigned long long buf = 0;     // bit buffer
    unsigned have = 0;              // number of bits in bit buffer
    size_t i = 0, n = 0;            // source and dest indices
    unsigned sym;                   // symbol being encoded
    for (;;) {
        // encode enough of source into bits to encode that to dest
        while (have < dbits && i < slen) {
            sym = source[i++];
            if (sym > smax) {
                *dlen = n;
                return 3;
            }
            if (sym < scut) {
                buf = (buf << (sbits - 1)) + sym;
                have += sbits - 1;
            }
            else {
                buf = (buf << sbits) + sym + scut;
                have += sbits;
            }
        }

        // if not enough bits to assure one symbol, then break out to a special
        // case for coding the final symbol
        if (have < dbits)
            break;

        // encode one symbol to dest
        if (n == *dlen)
            return 1;
        sym = buf >> (have - dbits + 1);
        if (sym < dcut) {
            dest[n++] = sym;
            have -= dbits - 1;
        }
        else {
            sym = buf >> (have - dbits);
            dest[n++] = sym - dcut;
            have -= dbits;
        }
        buf &= ((unsigned long long)1 << have) - 1;
    }

    // if any bits are left in the bit buffer, encode one last symbol to dest
    if (have) {
        if (n == *dlen)
            return 1;
        sym = buf;
        sym <<= dbits - 1 - have;
        if (sym >= dcut)
            sym = (sym << 1) - dcut;
        dest[n++] = sym;
    }

    // return recoded vector
    *dlen = n;
    return 0;
}

/* Test recode(). */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <assert.h>

// Return a random vector of len unsigned values in the range 0..max.
static void ranvec(unsigned *vec, size_t len, unsigned max) {
    unsigned bits = 0;
    unsigned long long mask = 1;
    while (mask <= max) {
        mask <<= 1;
        bits++;
    }
    mask--;
    unsigned long long ran = 0;
    unsigned have = 0;
    size_t n = 0;
    while (n < len) {
        while (have < bits) {
            ran = (ran << 31) + random();
            have += 31;
        }
        if ((ran & mask) <= max)
            vec[n++] = ran & mask;
        ran >>= bits;
        have -= bits;
    }
}

// Get a valid number from str and assign it to var
#define NUM(var, str) \
    do { \
        char *end; \
        unsigned long val = strtoul(str, &end, 0); \
        var = val; \
        if (*end || var != val) { \
            fprintf(stderr, \
                    "invalid or out of range numeric argument: %s\n", str); \
            return 1; \
        } \
    } while (0)

/* "bet n m len count" generates count test vectors of length len, where each
   entry is in the range 0..n.  Each vector is recoded to another vector using
   only symbols in the range 0..m.  That vector is recoded back to a vector
   using only symbols in 0..n, and that result is compared with the original
   random vector.  Report on the average ratio of input and output symbols, as
   compared to the optimal ratio for arbitrary precision base encoding. */
int main(int argc, char **argv)
{
    // get sizes of alphabets and length of test vector, compute maximum sizes
    // of recoded vectors
    unsigned smax, dmax, runs;
    size_t slen, dsize, bsize;
    if (argc != 5) { fputs("need four arguments\n", stderr); return 1; }
    NUM(smax, argv[1]);
    NUM(dmax, argv[2]);
    NUM(slen, argv[3]);
    NUM(runs, argv[4]);
    dsize = ceil(slen * ceil(log2(smax + 1.)) / floor(log2(dmax + 1.)));
    bsize = ceil(dsize * ceil(log2(dmax + 1.)) / floor(log2(smax + 1.)));

    // generate random test vectors, encode, decode, and compare
    srandomdev();
    unsigned source[slen], dest[dsize], back[bsize];
    unsigned mis = 0, i;
    unsigned long long dtot = 0;
    int ret;
    for (i = 0; i < runs; i++) {
        ranvec(source, slen, smax);
        size_t dlen = dsize;
        ret = recode(dest, &dlen, dmax, source, slen, smax);
        if (ret) {
            fprintf(stderr, "encode error %d\n", ret);
            break;
        }
        dtot += dlen;
        size_t blen = bsize;
        ret = recode(back, &blen, smax, dest, dlen, dmax);
        if (ret) {
            fprintf(stderr, "decode error %d\n", ret);
            break;
        }
        if (blen < slen || memcmp(source, back, slen))  // blen > slen is ok
            mis++;
    }
    if (mis)
        fprintf(stderr, "%u/%u mismatches!\n", mis, i);
    if (ret == 0)
        printf("mean dest/source symbols = %.4f (optimal = %.4f)\n",
               dtot / (i * (double)slen), log(smax + 1.) / log(dmax + 1.));
    return 0;
}

请您能否用简单易懂的语言描述一下您的算法是如何工作的,例如将一个三位数的三进制字符串转换为四进制字符串? - גלעד ברקן
在3进制中,符号编码为位串0、10和11。例如,向量0、1、2、1变为0101110。基数4只需每次去掉两个比特,编码为1、1、3、0。最后一个是特例,因为只有一位,所以它被移位使其成为两位。对于这种简短情况,您仍然可以得到四个符号。对于从3到4的长向量,输出符号为输入符号数量的83.3%。如果使用无限精度算术来完成此操作,则比率将为79.3%。 - Mark Adler

1
正如其他StackOverflow答案中所指出的,尽量不要认为将数字乘以基数的乘方然后相加等同于将其转换为十进制;相反,应该将其视为指导计算机用自己的术语生成表示由数字代表的数量的表示形式(对于大多数计算机来说,可能更接近我们概念中的二进制)。一旦计算机获得了数量的自己的表示形式,我们可以指示它以任何我们想要的方式输出数字。
通过拒绝“大整数”实现并要求逐字母转换,您同时在争论数量的数字/字母表示实际上并不是它所代表的东西,也就是说每个位置代表了digit * base^position的数量。如果《战争与和平》的第九百万个字符确实代表您要转换的内容,那么在某个时刻计算机将需要为Д * 33^9000000生成一个表示形式。

0

我不认为任何解决方案都能普遍适用,因为如果对于某个整数e和某个MAX_INT,n的e次方不等于m,那么就无法计算目标基数在某个位置p的值,如果n的p次方大于MAX_INT。

对于n的e次方等于m的情况,你可以通过递归来解决这个问题(将n的前e位数字相加并转换为M的第一位数字,然后将其去掉并重复操作)。

如果你没有这个有用的特性,那么最终你将不得不尝试取原始基数的某个部分,并尝试在n的p次方中执行模运算,而n的p次方将大于MAX_INT,这意味着这是不可能的。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接