Java随机字符串生成和生日悖论

6

我需要编写一个随机字符串生成类,该类可以从由数字和一些字母(10 + 26-5,即去掉五个元音字母)组成的31个字符集中生成7个字符的字符串。简单的数学计算表明,共有31^7种可能的组合,大约为27.5亿。 我对生日悖论存在疑问,我做了一些测试,发现重复数量呈指数增长。我能否采取措施避免这种情况的发生?

At 1 million, duplicates encountered till now = 19
At 2 million, duplicates encountered till now = 69
At 3 million, duplicates encountered till now = 157
At 4 million, duplicates encountered till now = 280
At 5 million, duplicates encountered till now = 470
At 6 million, duplicates encountered till now = 662
At 7 million, duplicates encountered till now = 896
At 8 million, duplicates encountered till now = 1185
At 9 million, duplicates encountered till now = 1500
At 10 million, duplicates encountered till now = 1823
At 11 million, duplicates encountered till now = 2204
At 12 million, duplicates encountered till now = 2584
At 13 million, duplicates encountered till now = 3020
At 14 million, duplicates encountered till now = 3527
At 15 million, duplicates encountered till now = 4110
At 16 million, duplicates encountered till now = 4683
At 17 million, duplicates encountered till now = 5284
At 18 million, duplicates encountered till now = 5919
At 19 million, duplicates encountered till now = 6611
At 20 million, duplicates encountered till now = 7343
At 21 million, duplicates encountered till now = 8095
At 22 million, duplicates encountered till now = 8858
At 23 million, duplicates encountered till now = 9707
At 24 million, duplicates encountered till now = 10547
At 25 million, duplicates encountered till now = 11452
At 26 million, duplicates encountered till now = 12399
At 27 million, duplicates encountered till now = 13356
At 28 million, duplicates encountered till now = 14393
At 29 million, duplicates encountered till now = 15369
At 30 million, duplicates encountered till now = 16436

以下是测试类:

import java.util.Set;

import org.apache.commons.lang3.RandomStringUtils;

import com.google.common.collect.Sets;

public class RandomUnivmylocaL {

    public static void main(String[] argv) {

        final int million = 1_000_000;

        final int iterations = 30;
        // 31 chars
        final char[] charArr = new char[] { '1', '2', '3', '4', '5', '6', '7', 
                '8', '9', '0', 'B', 'C', 'D', 'F', 'G', 'H', 'J', 'K', 'L',
                'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'X', 'Y', 'Z' };
        // System.out.println(charArr.length);

        final Set<String> set = Sets.newHashSetWithExpectedSize(
                iterations * million);

        for (int i = 0; i < iterations; i++) {
            for (int j = 0; j < million; j++) {
                final String univCode = RandomStringUtils.random(7, charArr);
                set.add(univCode);
            }
            System.out.println("At " + (i + 1) + " million, " +
                    "duplicates encountered till now = " + 
                    (((i + 1) * million) - set.size()));
        }
        System.out.println("done");
    }
}
1个回答

1
这是生日悖论。
sqrt(27.5bn) = 165831,大M的生日悖论公式为1.177*sqrt(M),因此当你生成约200,000个时,出现问题的可能性为50/50,在到达一百万时,你会有大约18个问题,等等。
生日问题 - 大约需要多少次才会发生碰撞 - 约为200,000。 http://www.wolframalpha.com/input/?i=n%3D200000,+m%3D(31%5E7),+n+-+m(1+-+((m-1)%2Fm)%5En)

将n设置为23.0,将m设置为365,即可看到房间里有23个人时的等效情况。 http://www.wolframalpha.com/input/?i=n%3D23.0,+m%3D365,+n+-+m(1+-+((m-1)%2Fm)%5En)

您可以看到在大数情况下模拟结果与预期答案的接近程度。

http://www.wolframalpha.com/input/?i=n%3D30e6,+m%3D(31%5E7),+n+-+m(1+-+((m-1)%2Fm)%5En)

这篇Quora文章很好。http://www.quora.com/How-can-I-calculate-the-expected-number-of-cache-hits

所以你需要增加允许字符数或使用更长的字符串。 或者 - 使用7位数字,只需递增计数器。 或者使用随机数,并检查是否以前使用过,当你厌倦寻找新数字时重置。

还有伪随机生成器可以覆盖空间,而不必重新命中。 7个字符不足以使您的解决方案安全。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接