使用概率表生成长度为K的N个“随机”字符串

Question

使用概率表生成长度为K的N个“随机”字符串

pythonstringrandomprobabilitypython-itertools

4

如何使用概率表创建长度为K的N个“随机”字符串？其中K应该是一个偶数。

prob_table = {'aa': 0.2, 'ab': 0.3, 'ac': 0.5}

假设K = 6，字符串'acacab'的概率比'aaaaaa'的概率更高。

这是一个更大问题的子问题，我正在使用它来生成基于概率表的合成序列。我不确定如何使用概率表生成“随机”的字符串？

目前为止，我所拥有的是：

def seq_prob(fprob_table,K= 6, N= 10):
    #fprob_table is the probability dictionary that you input
    #K is the length of the sequence
    #N is the amount of sequences
    seq_list = []
    #possibly using itertools or random to generate the semi-"random" strings based on the probabilities 
    return seq_list

- nfs

这是一个很好的问题，随机模型序列会非常有用！ - O.rka

4个回答

2

您可以使用random.random：

from random import random
def seq_prob(fprob_table, K=6, N=10):
    #fprob_table is the probability dictionary that you input
    #K is the length of the sequence
    #N is the amount of sequences
    seq_list = []
    s = ""
    while len(seq_list) < N:
        for k, v in fprob_table.items():
            if len(s) == K:
                seq_list.append(s)
                s = ""
                break
            rn = random()
            if rn <=  v:
                s += k
    return seq_list

这可以进一步改进，但 random.random 在处理概率时非常有用。

- Padraic Cunningham

我喜欢这种方法胜过我之前所做的建立列表。不过，我认为你需要确保概率已经排序。可以尝试使用以下代码：ordered_probs = sorted((prob, char_pair) for char_pair, prob in fprob_table.items())。 - monkut

1

我相信有一种更加简洁/好的方法，但这里提供了一种简单的方法来实现这个功能。

在这里，我们使用概率确定数量，将100个不同的字符对值填充到“pick_list”中。在这种情况下，“pick_list”中有20个“aa”项，30个“ab”项和50个“ac”项。然后，“random.choice(pick_list)”从列表中均匀地随机选择一个条目。

import random

prob_table = {'aa': 0.2, 'ab': 0.3, 'ac': 0.5}


def seq_prob(fprob_table, K=6, N=10):
    #fprob_table is the probability dictionary that you input

    # fill list with number of items based on the probabilities
    pick_list = []
    for key, prob in fprob_table.items():
        pick_list.extend([key] * int((prob * 100)))    

    #K is the length of the sequence
    #N is the amount of sequences
    seq_list = []
    for i in range(N):
        sub_seq = "".join(random.choice(pick_list) for _ in range(int(K/2)))
        seq_list.append(sub_seq)
    return seq_list

带有结果的：

 seq_prob(prob_table)
['ababac',
 'aaacab',
 'aaaaac',
 'acacac',
 'abacac',
 'acaaac',
 'abaaab',
 'abaaab',
 'aaabaa',
 'aaabaa']

- monkut

0

如果您的表格或序列很大，使用numpy可能会有帮助，因为它可能会显着提高速度。此外，numpy专门用于解决这类问题，而且方法简单易懂，仅需3到4行代码。

基本思路是将概率转换为累积概率，即将(.2, .5, .3)映射为(.2, .7, 1.)，然后在从0到1的平坦分布中生成随机数，这些随机数将落入累积和的箱子中，频率与权重相对应。可以使用Numpy的searchsorted快速找到随机值所在的箱子。也就是说，

import numpy as np

prob_table = {'aa': 0.2, 'ab': 0.3, 'ac': 0.5}
N = 10
k = 3   # number of strings (not number of characters)

rvals = np.random.random((N, k))         # generate a bunch of random values
string_indices = np.searchsorted(np.cumsum(prob_table.values()), rvals)   # weighted indices
x = np.array(prob_table.keys())[string_indices]     # get the strings associated with the indices
y = ["".join(x[i,:]) for i in range(x.shape[0])]    # convert this to a list of strings

# y = ['acabab', 'acacab', 'acabac', 'aaacaa', 'acabac', 'acacab', 'acabaa', 'aaabab', 'abacac', 'aaabab']

这里我使用k作为所需字符串的数量，而不是K作为字符的数量，因为问题陈述在字符串/字符方面存在歧义。

- tom10

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Blckknght · Accepted Answer

有一些良好的方法可以实现加权随机选择，这些方法在Python内置random模块的文档末尾描述：

一个常见任务是使用加权概率进行随机选择。

如果权重是小整数比例，则可以使用一种简单的技术来构建一个带有重复项的样本集：

>>> weighted_choices = [('Red', 3), ('Blue', 2), ('Yellow', 1), ('Green', 4)]
>>> population = [val for val, cnt in weighted_choices for i in range(cnt)]
>>> random.choice(population)
'Green'

更加通用的方法是使用itertools.accumulate()将权重累积到一个累积分布中，然后使用bisect.bisect()找到随机值的位置：

>>> choices, weights = zip(*weighted_choices)
>>> cumdist = list(itertools.accumulate(weights))
>>> x = random.random() * cumdist[-1]
>>> choices[bisect.bisect(cumdist, x)]
'Blue'

为了让你的问题得到更好的解决，我建议采取以下方法：

```html

要将上述方法应用于你的具体问题，可以这样做：

```

import random
import itertools
import bisect

def seq_prob(fprob_table, K=6, N=10):
    choices, weights = fprob_table.items()
    cumdist = list(itertools.accumulate(weights))

    results = []
    for _ in range(N):
        s = ""
        while len(s) < K:
            x = random.random() * cumdist[-1]
            s += choices[bisect.bisect(cumdist, x)]
        results.append(s)

    return results

假设您的概率表中的关键字符串长度都相同。如果它们有多个不同的长度，则此代码有时（也许大部分时间！）会给出比 K 字符更长的答案。我想它还假设 K 是关键字长度的精确倍数，尽管如果不是这样，它实际上也可以工作（只是会给出所有比 K 字符长的结果字符串，因为无法准确获得 K）。