我正在创建一个Python脚本,从这里列出的男性名字列表中随机挑选1000个名字:http://www.census.gov/genealogy/www/data/1990surnames/names_files.html。虽然这样做很好,但我希望根据人口普查文本文件提供的概率列来选择名称(第二列)。我已经试图在过去的几个小时里理解这个问题,但我没有取得任何真正的进展,甚至寻找其他答案也没有用。有谁能帮帮我或指点我正确的方向?提前感谢:)
为每个名称分配其相对概率,使所有概率之和为1。这个相对值被称为“权重”。
选择一个介于0和1之间的随机数。
遍历列表,在遍历时从该数字中减去每个项目的权重。
当您到达0或更低时,选择当前项目。
import urllib2
import random
import bisect
url = 'http://www.census.gov/genealogy/www/data/1990surnames/dist.male.first'
response = urllib2.urlopen(url)
names, cumprobs = [], []
for line in response:
name, prob, cumprob, rank = line.split()
cumprob = float(cumprob)
names.append(name)
cumprobs.append(cumprob)
# normalize the cumulative probabilities to the range [0, 1]
cumprobs = [p/cumprobs[-1] for p in cumprobs]
# print(cumprobs)
# Generate 1000 names at random, using the cumulative probability distribution
N = 1000
selected = [names[bisect.bisect(cumprobs, random.random())] for i in xrange(N)]
print('\n'.join(selected))
import random
filename = r"location/of/file"
data = list() # accumulator
with open(filename) as in_:
for line in in_:
name, prob, *_ = line.split()
for _ in range(int(float(prob)*1000)):
data.append(name)
print(random.choice(data))