计算字典序索引的最高效方法

9
有没有更高效的算法可以完成以下任务?:对于给定的0到7整数的排列,返回描述排列在字典顺序下的索引(从0开始索引,而不是1)。
例如: - 数组0 1 2 3 4 5 6 7应该返回一个索引为0的值。 - 数组0 1 2 3 4 5 7 6应该返回一个索引为1的值。 - 数组0 1 2 3 4 6 5 7应该返回一个索引为2的值。 - 数组1 0 2 3 4 5 6 7应该返回一个索引为5039的值(即7!-1或factorial(7)-1)。 - 数组7 6 5 4 3 2 1 0应该返回一个索引为40319的值(即8!-1)。这是最大可能的返回值。
我的当前代码如下:
int lexic_ix(int* A){
    int value = 0;
    for(int i=0 ; i<7 ; i++){
        int x = A[i];
        for(int j=0 ; j<i ; j++)
            if(A[j]<A[i]) x--;
        value += x*factorial(7-i);  // actual unrolled version doesn't have a function call
    }
    return value;
}

我想知道是否有办法通过移除内部循环来减少操作次数,或者是否可以以任何方式减少条件分支(除了展开-我的当前代码实际上是上述内容的展开版本),还有没有什么巧妙的位运算技巧或肮脏的C语言技巧可以帮助。
我已经尝试过替换。
if(A[j]<A[i]) x--;

使用

x -= (A[j]<A[i]);

我也尝试过

x = A[j]<A[i] ? x-1 : x;

实际上,这两个替换都导致了更糟糕的性能。

在任何人说之前 - 是的,这是一个巨大的性能瓶颈:目前程序运行时间的约61%花费在此函数中,并且不,我不想有预计算值的表格。

除了这些,欢迎任何建议。


2
如果您担心性能问题,您还需要展示函数factorial()的代码。 - Daniel
我实际上没有名为"factorial"的函数。正如提到的那样,我完全展开了循环,这使得我可以在原地编写字面值。 - A Frayed Knot
@Daniel 不是的。阶乘可以通过查找表、模板扩展或者在运行之前预先计算并在每次迭代时除以 N-i 来实现常数时间。 - KevinZ
@KevinZ他可以做任何这些事情,但是除非我们看到函数定义,否则我们不知道。 - Daniel
4个回答

2
不知道这是否有所帮助,但这里有另外一种解决方案:
int lexic_ix(int* A, int n){ //n = last index = number of digits - 1
    int value = 0;
    int x = 0;
    for(int i=0 ; i<n ; i++){
        int diff = (A[i] - x); //pb1
        if(diff > 0)
        {
            for(int j=0 ; j<i ; j++)//pb2
            {
                if(A[j]<A[i] && A[j] > x)
                {
                    if(A[j]==x+1)
                    {
                      x++;
                    }
                    diff--;
                }
            }
            value += diff;
        }
        else
        {
          x++;
        }
        value *= n - i;
    }
    return value;
}

我无法摆脱内部循环,因此在最坏情况下复杂度为o(n log(n)),但在最好情况下为o(n),而您的解决方案在所有情况下都是o(n log(n))。

或者,您可以通过以下方式替换内部循环,以消除一些最坏情况,但需要在内部循环中进行另一个验证:

int j=0;
while(diff>1 && j<i)
{
  if(A[j]<A[i])
  {
    if(A[j]==x+1)
    {
      x++;
    }
    diff--;
  }
  j++;
}

解释

(或者更确切地说,“我是如何得到那段代码的”,我认为它与你的代码并没有太大区别,但可能会给你一些想法。 为了避免混淆,我使用了字符而不是数字,并且只用了四个字符。)

abcd 0  = ((0 * 3 + 0) * 2 + 0) * 1 + 0
abdc 1  = ((0 * 3 + 0) * 2 + 1) * 1 + 0
acbd 2  = ((0 * 3 + 1) * 2 + 0) * 1 + 0
acdb 3  = ((0 * 3 + 1) * 2 + 1) * 1 + 0
adbc 4  = ((0 * 3 + 2) * 2 + 0) * 1 + 0
adcb 5  = ((0 * 3 + 2) * 2 + 1) * 1 + 0 //pb1
bacd 6  = ((1 * 3 + 0) * 2 + 0) * 1 + 0
badc 7  = ((1 * 3 + 0) * 2 + 1) * 1 + 0
bcad 8  = ((1 * 3 + 1) * 2 + 0) * 1 + 0 //First reflexion
bcda 9  = ((1 * 3 + 1) * 2 + 1) * 1 + 0
bdac 10 = ((1 * 3 + 2) * 2 + 0) * 1 + 0
bdca 11 = ((1 * 3 + 2) * 2 + 1) * 1 + 0
cabd 12 = ((2 * 3 + 0) * 2 + 0) * 1 + 0
cadb 13 = ((2 * 3 + 0) * 2 + 1) * 1 + 0
cbad 14 = ((2 * 3 + 1) * 2 + 0) * 1 + 0
cbda 15 = ((2 * 3 + 1) * 2 + 1) * 1 + 0 //pb2
cdab 16 = ((2 * 3 + 2) * 2 + 0) * 1 + 0
cdba 17 = ((2 * 3 + 2) * 2 + 1) * 1 + 0
[...]
dcba 23 = ((3 * 3 + 2) * 2 + 1) * 1 + 0

第一次“反思”:

熵的角度来看,abcd具有最少的“熵”。如果一个字符在一个它“不应该”出现的地方,它就会产生熵,而早期产生的熵越大。

例如,对于bcad,词典索引为8 = ((1 * 3 + 1) * 2 + 0) * 1 + 0,可以这样计算:

value = 0;
value += max(b - a, 0); // = 1; (a "should be" in the first place [to create the less possible entropy] but instead it is b)
value *= 3 - 0; //last index - current index
value += max(c - b, 0); // = 1; (b "should be" in the second place but instead it is c)
value *= 3 - 1;
value += max(a - c, 0); // = 0; (a "should have been" put earlier, so it does not create entropy to put it there)
value *= 3 - 2;
value += max(d - d, 0); // = 0;

Note that the last operation will always do nothing, that's why "i

First problem (pb1) :

For adcb, for example, the first logic doesn't work (it leads to an lexicographic index of ((0* 3+ 2) * 2+ 0) * 1 = 4) because c-d = 0 but it creates entropy to put c before b. I added x because of that, it represents the first digit/character that isn't placed yet. With x, diff cannot be negative. For adcb, lexicographic index is 5 = ((0 * 3 + 2) * 2 + 1) * 1 + 0 and can be calculated that way :

value = 0; x=0;
diff = a - a; // = 0; (a is in the right place)
diff == 0 => x++; //x=b now and we don't modify value
value *= 3 - 0; //last index - current index
diff = d - b; // = 2; (b "should be" there (it's x) but instead it is d)
diff > 0 => value += diff; //we add diff to value and we don't modify x
diff = c - b; // = 1; (b "should be" there but instead it is c) This is where it differs from the first reflexion
diff > 0 => value += diff;
value *= 3 - 2;

Second problem (pb2) :

For cbda, for example, lexicographic index is 15 = ((2 * 3 + 1) * 2 + 1) * 1 + 0, but the first reflexion gives : ((2 * 3 + 0) * 2 + 1) * 1 + 0 = 13 and the solution to pb1 gives ((2 * 3 + 1) * 2 + 3) * 1 + 0 = 17. The solution to pb1 doesn't work because the two last characters to place are d and a, so d - a "means" 1 instead of 3. I had to count the characters placed before that comes before the character in place, but after x, so I had to add an inner loop.

Putting it all together :

I then realised that pb1 was just a particular case of pb2, and that if you remove x, and you simply take diff = A[i], we end up with the unnested version of your solution (with factorial calculated little by little, and my diff corresponding to your x).

So, basically, my "contribution" (I think) is to add a variable, x, which can avoid doing the inner loop when diff equals 0 or 1, at the expense of checking if you have to increment x and doing it if so.

I also checked if you have to increment x in the inner loop (if(A[j]==x+1)) because if you take for example badce, x will be b at the end because a comes after b, and you will enter the inner loop one more time, encountering c. If you check x in the inner loop, when you encounter d you have no choice but doing the inner loop, but x will update to c, and when you encounter c you will not enter the inner loop. You can remove this check without breaking the program

With the alternative version and the check in the inner loop it makes 4 different versions. The alternative one with the check is the one in which you enter the less the inner loop, so in terms of "theoretical complexity" it is the best, but in terms of performance/number of operations, I don't know.

Hope all of this helps (since the question is rather old, and I didn't read all the answers in details). If not, I still had fun doing it. Sorry for the long post. Also I'm new on Stack Overflow (as a member), and not a native speaker, so please be nice, and don't hesitate to let me know if I did something wrong.


0

线性遍历已经在缓存中的内存不需要花费太多时间。不用担心,你在factorial()溢出之前不会遍历足够的距离。

8作为参数移出。

int factorial ( int input )
{
    return input ? input * factorial (input - 1) : 1;
}

int lexic_ix ( int* arr, int N )
{
    int output = 0;
    int fact = factorial (N);
    for ( int i = 0; i < N - 1; i++ )
    {
        int order = arr [ i ];
        for ( int j = 0; j < i; j++ )
            order -= arr [ j ] < arr [ i ];
        output += order * (fact /= N - i);
    }
    return output;
}

int main()
{
    int arr [ ] = { 11, 10, 9, 8, 7 , 6 , 5 , 4 , 3 , 2 , 1 , 0 };

    const int length = 12;
    for ( int i = 0; i < length; ++i )
        std::cout << lexic_ix ( arr + i, length - i  ) << std::endl;
}

感谢您的输入。然而,1 实际上是不正确的。7-i 才是预期的参数。请尝试在数组 [7 6 5 4 3 2 1 0] 上运行您的代码。我希望它会产生40319,但您的代码从未产生超过5039的值(如果我没有弄错的话)。另外,您的第三个观点已经被考虑在内 - 我不检查第8个元素,只检查到索引6(第7个元素)。 - A Frayed Knot
抱歉,我误读了你的限制。不管怎样,我的代码确实针对8个元素进行了反向排序并返回了40319。 - KevinZ
你得到了5039,因为你使用了7而不是8。N应该指定元素的数量。我将编辑它以包括基本的阶乘实现和一些测试。 - KevinZ
哎呀,你说得完全正确 - 所有输出都是正确的。不幸的是,即使展开和编写运行时常量(即N,factorial(N)),这个实现的速度也只有我的原始实现的1/3左右。 - A Frayed Knot

0
对于一个M位数的排列,从你的代码中,你可以得到字典序SN公式,它类似于:Am-1*(m-1)! + Am-2*(m-2)! + ... + A0*(0)!,其中Aj的范围是0到j。你可以从A0*(0)!开始计算SN,然后是A1*(1)!,...,然后是Am-1 * (m-1)!,将它们加在一起(假设你的整数类型不会溢出),因此你不需要递归和重复地计算阶乘。SN号码的范围是0到M!-1(因为Sum(n*n!, n in 0,1, ...n) = (n+1)!-1)。
如果你不是递归地计算阶乘,我想不出任何能够有所改进的方法。
抱歉稍微晚了一点发代码,我刚刚做了一些研究,并找到了这个链接: http://swortham.blogspot.com.au/2011/10/how-much-faster-is-multiplication-than.html 根据这位作者的说法,整数乘法可以比整数除法快40倍。虽然浮点数没有那么戏剧性,但这里是纯整数。
int lexic_ix ( int arr[], int N )
{
    // if this function will be called repeatedly, consider pass in this pointer as parameter
    std::unique_ptr<int[]> coeff_arr = std::make_unique<int[]>(N);
    for ( int i = 0; i < N - 1; i++ )
    {
        int order = arr [ i ];
        for ( int j = 0; j < i; j++ )
            order -= arr [ j ] < arr [ i ];
        coeff_arr[i] = order; // save this into coeff_arr for later multiplication
    }
    // 
    // There are 2 points about the following code:
    // 1). most modern processors have built-in multiplier, \
    //    and multiplication is much faster than division
    // 2). In your code, you are only the maximum permutation serial number,
    //     if you put in a random sequence, say, when length is 10, you put in
    //     a random sequence, say, {3, 7, 2, 9, 0, 1, 5, 8, 4, 6}; if you look into
    //     the coeff_arr[] in debugger, you can see that coeff_arr[] is:
    //     {3, 6, 2, 6, 0, 0, 1, 2, 0, 0}, the last number will always be zero anyway.
    //     so, you will have good chance to reduce many multiplications.
    //     I did not do any performance profiling, you could have a go, and it will be
    //     much appreciated if you could give some feedback about the result.
    //
    long fac = 1;
    long sn = 0;
    for (int i = 1; i < N; ++i) // start from 1, because coeff_arr[N-1] is always 0 
    {
        fac *= i;
        if (coeff_arr[N - 1 - i])
            sn += coeff_arr[N - 1 - i] * fac;
    }
    return sn;
}

int main()
{
    int arr [ ] = { 3, 7, 2, 9, 0, 1, 5, 8, 4, 6 }; // try this and check coeff_arr

    const int length = 10;
    std::cout << lexic_ix(arr, length ) << std::endl;
    return 0;
}

好的,所以你的代码运行了一百万个样本 @ ~0.086秒。我的运行时间是 ~0.060秒。这是一个不错的想法,但实际上它们是相同的代码 - 我的只是逐个计算每个系数,而你的则首先预先计算所有系数。 - A Frayed Knot
谢谢您的反馈,这很有趣。在您的代码中,您对每个系数进行乘法和除法,但在我的代码中,您没有除法,并且因为许多系数都是零,所以可以跳过许多乘法。我想自己进行性能分析。顺便问一下,您能告诉我(最好是在代码中)您是如何进行分析的吗? - dguan
1
我自己进行了分析,结果恰恰相反。我为9位和10位数字生成了所有可能的排列,排列总数分别为362,880和3,628,800。我的代码比你的代码快了20%以上。我猜想你可能只使用了一些特定的数字,比如最大的序列号N!-1。对于这种情况,数字中没有额外的零,因此不会跳过额外的零。我将发布另一个答案,包括测试代码和结果。 - dguan
可能的区别在于,我使用了多次运行100万个样本的测试,这些样本是在一个随机洗牌的8个整数集合中进行的(因为我真正需要性能提升的是8个整数)。我还尝试了您的代码,包括原始版本和展开后内联常量的版本。我正在使用带有-O3标志的g++4.9.0编译器。值得注意的是,我的编译器不认识“make_unique”调用,因为它不是标准的C++11语法,所以我不得不用普通的int数组替换它(因为我对unique ptrs并不了解)。我刚刚重新测试了一下,结果仍然相同。 - A Frayed Knot
此外,我使用了<chrono>库的high_resolution_clock::time_point类型和high_resolution_clock::now()函数来获取当前时间。我尝试对一百万个计时进行采样和求和,并且还尝试在所有一百万个样本运行中进行单次计时,两种方法得到的结果相当。我的台式电脑是一台3.0 gHz四核处理器,拥有8GB RAM和缓存大小为32k、256k和6144k。 - A Frayed Knot
值得一提的是,您所链接的文章有些过时,并且它提到了浮点算术——与整数算术相差很大。对于整数乘法与整数除法而言,在您的处理器上性能差异接近20%。 - A Frayed Knot

0

这是整个性能分析代码,我只在Linux上运行测试,使用G++8.4编译代码,并使用“-std=c++11 -O3”编译器选项。为了公平起见,我稍微改写了您的代码,预先计算了N!并将其传递到函数中,但似乎并没有帮助太多。

N = 9(362,880个排列)的性能分析如下:

  • 时间持续时间为:34、30、25毫秒
  • 时间持续时间为:34、30、25毫秒
  • 时间持续时间为:33、30、25毫秒

N=10(3,628,800个排列)的性能分析如下:

  • 时间持续时间为:345、335、275毫秒
  • 时间持续时间为:348、334、275毫秒
  • 时间持续时间为:345、335、275毫秒

第一个数字是您的原始函数,第二个数字是重写的函数,该函数传递N!,最后一个数字是我的结果。排列生成函数非常原始且运行缓慢,但只要它生成所有排列作为测试数据集,那就没问题了。顺便说一句,这些测试在运行Ubuntu 14.04的四核3.1Ghz,4GBytes桌面上运行。

编辑:我忘记了第一个函数可能需要扩展lexi_numbers向量的因素,因此我在计时之前放了一个空调用。在此之后,时间为333、334、275。

编辑:另一个可能影响性能的因素是,我在代码中使用长整数,如果我将这两个'long'更改为2个'int',运行时间将变为:334、333、264。

#include <iostream>
#include <vector>
#include <chrono>
using namespace std::chrono;

int factorial(int input)
{
    return input ? input * factorial(input - 1) : 1;
}

int lexic_ix(int* arr, int N)
{
    int output = 0;
    int fact = factorial(N);
    for (int i = 0; i < N - 1; i++)
    {
        int order = arr[i];
        for (int j = 0; j < i; j++)
            order -= arr[j] < arr[i];
        output += order * (fact /= N - i);
    }
    return output;
}

int lexic_ix1(int* arr, int N, int N_fac)
{
    int output = 0;
    int fact = N_fac;
    for (int i = 0; i < N - 1; i++)
    {
        int order = arr[i];
        for (int j = 0; j < i; j++)
            order -= arr[j] < arr[i];
        output += order * (fact /= N - i);
    }
    return output;
}

int lexic_ix2( int arr[], int N , int coeff_arr[])
{
    for ( int i = 0; i < N - 1; i++ )
    {
        int order = arr [ i ];
        for ( int j = 0; j < i; j++ )
            order -= arr [ j ] < arr [ i ];
        coeff_arr[i] = order;
    }
    long fac = 1;
    long sn = 0;
    for (int i = 1; i < N; ++i)
    {
        fac *= i;
        if (coeff_arr[N - 1 - i])
            sn += coeff_arr[N - 1 - i] * fac;
    }
    return sn;
}

std::vector<std::vector<int>> gen_permutation(const std::vector<int>& permu_base)
{
    if (permu_base.size() == 1)
        return std::vector<std::vector<int>>(1, std::vector<int>(1, permu_base[0]));

    std::vector<std::vector<int>> results;
    for (int i = 0; i < permu_base.size(); ++i)
    {
        int cur_int = permu_base[i];
        std::vector<int> cur_subseq = permu_base;
        cur_subseq.erase(cur_subseq.begin() + i);
        std::vector<std::vector<int>> temp = gen_permutation(cur_subseq);
        for (auto x : temp)
        {
            x.insert(x.begin(), cur_int);
            results.push_back(x);
        }
    }
    return results;
}

int main()
{
    #define N 10
    std::vector<int> arr;
    int buff_arr[N];
    const int length = N;
    int N_fac = factorial(N);
    for(int i=0; i<N; ++i)
        arr.push_back(N-i-1); // for N=10, arr is {9, 8, 7, 6, 5, 4, 3, 2, 1, 0}
    std::vector<std::vector<int>> all_permus = gen_permutation(arr);

    std::vector<int> lexi_numbers;
    // This call is not timed, only to expand the lexi_numbers vector 
    for (auto x : all_permus)
        lexi_numbers.push_back(lexic_ix2(&x[0], length, buff_arr));

    lexi_numbers.clear();
    auto t0 = high_resolution_clock::now();
    for (auto x : all_permus)
        lexi_numbers.push_back(lexic_ix(&x[0], length));
    auto t1 = high_resolution_clock::now();
    lexi_numbers.clear();
    auto t2 = high_resolution_clock::now();
    for (auto x : all_permus)
        lexi_numbers.push_back(lexic_ix1(&x[0], length, N_fac));
    auto t3 = high_resolution_clock::now();
    lexi_numbers.clear();
    auto t4 = high_resolution_clock::now();
    for (auto x : all_permus)
        lexi_numbers.push_back(lexic_ix2(&x[0], length, buff_arr));
    auto t5 = high_resolution_clock::now();

std::cout << std::endl << "Time durations are: " << duration_cast<milliseconds> \
    (t1 -t0).count() << ", " << duration_cast<milliseconds>(t3 - t2).count() << ", " \
        << duration_cast<milliseconds>(t5 - t4).count() <<" milliseconds" << std::endl;
    return 0;
}

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接