在Python中从单词列表返回一个随机单词

Question

在Python中从单词列表返回一个随机单词

6

我想使用Python从文件中随机检索一个单词，但我不认为我的方法是最好或最有效的。请帮忙。

import fileinput
import _random
file = [line for line in fileinput.input("/etc/dictionaries-common/words")]
rand = _random.Random()
print file[int(rand.random() * len(file))],

- kzh

2

请注意，虽然在大多数情况下应该使用open()，但file()仍然是Python内置函数（用于Python 2.x），并且可能不应该用作变量名。 - Kenan Banks

这些解决方案大多不适用于Python 3。 - kzh

8个回答

9

另一种解决方案是使用 getline。

import linecache
import random
line_number = random.randint(0, total_num_lines)
linecache.getline('/etc/dictionaries-common/words', line_number)

从文档中得知：

linecache模块允许获取任何文件的任何行，同时尝试进行内部优化，使用缓存来处理常见情况，即从单个文件读取多行

编辑：您可以计算总数并存储它，因为字典文件不太可能更改。

- Nadia Alramli

我该如何知道使用这种方法的总行数？ - kzh

你可以计算总数并存储它，因为字典文件不太可能改变。 - Nadia Alramli

9

>>> import random
>>> random.choice(list(open('/etc/dictionaries-common/words')))
'jaundiced\n'

这是一个高效的人类时间方面的实现。

顺便一提，你的实现与stdlib的random.py中的实现相符：

 def choice(self, seq):
    """Choose a random element from a non-empty sequence."""
    return seq[int(self.random() * len(seq))]

测量时间性能

我想知道所提出解决方案的相对性能。基于linecache的方法是显而易见的首选。与实现在select_random_line()中的诚实算法相比，random.choice的一行代码有多慢？

# nadia_known_num_lines   9.6e-06 seconds 1.00
# nadia                   0.056 seconds 5843.51
# jfs                     0.062 seconds 1.10
# dcrosta_no_strip        0.091 seconds 1.48
# dcrosta                 0.13 seconds 1.41
# mark_ransom_no_strip    0.66 seconds 5.10
# mark_ransom_choose_from 0.67 seconds 1.02
# mark_ransom             0.69 seconds 1.04

每个函数都被调用了10次（缓存性能）。

这些结果表明，在这种情况下，简单的解决方案（dcrosta）比更谨慎的解决方案（mark_ransom）更快。

用于比较的代码（作为要点）：

import linecache
import random
from timeit import default_timer


WORDS_FILENAME = "/etc/dictionaries-common/words"


def measure(func):
    measure.func_to_measure.append(func)
    return func
measure.func_to_measure = []


@measure
def dcrosta():
    words = [line.strip() for line in open(WORDS_FILENAME)]
    return random.choice(words)


@measure
def dcrosta_no_strip():
    words = [line for line in open(WORDS_FILENAME)]
    return random.choice(words)


def select_random_line(filename):
    selection = None
    count = 0
    for line in file(filename, "r"):
        if random.randint(0, count) == 0:
            selection = line.strip()
            count = count + 1
    return selection


@measure
def mark_ransom():
    return select_random_line(WORDS_FILENAME)


def select_random_line_no_strip(filename):
    selection = None
    count = 0
    for line in file(filename, "r"):
        if random.randint(0, count) == 0:
            selection = line
            count = count + 1
    return selection


@measure
def mark_ransom_no_strip():
    return select_random_line_no_strip(WORDS_FILENAME)


def choose_from(iterable):
    """Choose a random element from a finite `iterable`.

    If `iterable` is a sequence then use `random.choice()` for efficiency.

    Return tuple (random element, total number of elements)
    """
    selection, i = None, None
    for i, item in enumerate(iterable):
        if random.randint(0, i) == 0:
            selection = item

    return selection, (i+1 if i is not None else 0)


@measure
def mark_ransom_choose_from():
    return choose_from(open(WORDS_FILENAME))


@measure
def nadia():
    global total_num_lines
    total_num_lines = sum(1 for _ in open(WORDS_FILENAME))

    line_number = random.randint(0, total_num_lines)
    return linecache.getline(WORDS_FILENAME, line_number)


@measure
def nadia_known_num_lines():
    line_number = random.randint(0, total_num_lines)
    return linecache.getline(WORDS_FILENAME, line_number)


@measure
def jfs():
    return random.choice(list(open(WORDS_FILENAME)))


def timef(func, number=1000, timer=default_timer):
    """Return number of seconds it takes to execute `func()`."""
    start = timer()
    for _ in range(number):
        func()
    return (timer() - start) / number


def main():
    # measure time
    times = dict((f.__name__, timef(f, number=10))
                 for f in measure.func_to_measure)

    # print from fastest to slowest
    maxname_len = max(map(len, times))
    last = None
    for name in sorted(times, key=times.__getitem__):
        print "%s %4.2g seconds %.2f" % (name.ljust(maxname_len), times[name],
                                         last and times[name] / last or 1)
        last = times[name]


if __name__ == "__main__":
    main()

- jfs

3

这是源自如何使用C返回文本文件中随机一行的最佳方法？的答案，经过我的Python化处理:

import random

def select_random_line(filename):
    selection = None
    count = 0
    for line in file(filename, "r"):
        if random.randint(0, count) == 0:
            selection = line.strip()
        count = count + 1
    return selection

print select_random_line("/etc/dictionaries-common/words")

编辑：我原来的回答使用了readlines，但是它并没有像我想象的那样起作用，而且完全没有必要。这个版本将遍历文件而不是将其全部读入内存，并在单次遍历中完成操作，这应该比迄今为止看到的任何答案都更有效率。

通用版本

import random

def choose_from(iterable):
    """Choose a random element from a finite `iterable`.

    If `iterable` is a sequence then use `random.choice()` for efficiency.

    Return tuple (random element, total number of elements)
    """
    selection, i = None, None
    for i, item in enumerate(iterable):
        if random.randint(0, i) == 0:
            selection = item

    return selection, (i+1 if i is not None else 0)

示例

print choose_from(open("/etc/dictionaries-common/words"))
print choose_from(dict(a=1, b=2))
print choose_from(i for i in range(10) if i % 3 == 0)
print choose_from(i for i in range(10) if i % 11 == 0 and i) # empty
print choose_from([0]) # one element
chunk, n = choose_from(urllib2.urlopen("http://google.com"))
print (chunk[:20], n)

输出

('yeps\n'， 98569)
('a'， 2)
(6，4)
(None，0)
(0，1)
('window._gjp && _gjp（'，10)

- Mark Ransom

通过递增维护索引不太符合Python的风格。我可以建议使用以下代码：for count, line in enumerate(file(filename, "r")): - recursive

我从未使用过enumerate，但它看起来是一个不错的建议。谢谢。在StackOverflow上发布问题的一个意外好处就是学到了新东西。 - Mark Ransom

我已经添加了通用版本，适用于任何有限可迭代对象。 - jfs

马克，没有测量数据很难确定哪个版本（在所有答案中）更快。 - jfs

J.F.，感谢您扩展和改进我的回答。关于预测速度，您当然是正确的，测量是金标准 - 但人们总是可以作出有根据的猜测。 - Mark Ransom

1

我的猜测是，通常情况下我对性能的猜测是错误的。我已经在我的回答中添加了一些测量数据 https://dev59.com/MUnSa4cB1Zd3GeqPOHQ5#1457124 - jfs

1

您可以在不使用fileinput的情况下完成此操作：

import random
data = open("/etc/dictionaries-common/words").readlines()
print random.choice(data)

我也使用了data 而不是 file，因为 file 是 Python 中的预定义类型。

- Greg Hewgill

1

我没有给你代码，但是关于算法：

找到文件的大小
使用seek()函数进行随机查找
找到下一个（或上一个）空格字符
返回该空格字符后面开始的单词

- Jason Christa

如何在Python中查找文件的大小？这肯定会更有效率。 - kzh

1

os.stat(path).st_size。但请注意，此方法并不完全“公平”：长单词后面的单词更有可能被选择。 - bobince

0

在这种情况下，效率和冗长并不是同一回事。诱人的是采用最美观、最Pythonic的方法，用一两行代码完成所有操作，但对于文件I/O，还是坚持使用经典的fopen-style、低级别交互方式，即使它需要更多的代码行。

我可以复制粘贴一些代码，并声称它是我的（如果其他人想要也可以），但看看这个链接：http://mail.python.org/pipermail/tutor/2007-July/055635.html

- Oli

在文件中选择随机点会使您的选择偏向于更长的单词，例如“反对清教徒主义”（或其后面的单词，这取决于您的实现），将比跟随“a”的单词出现28倍频繁。 - Anthony Towns

0

有几种不同的方法可以优化这个问题。您可以优化速度或空间。

如果您想要一个快速但占用内存的解决方案，请使用file.readlines()读取整个文件，然后使用random.choice()

如果您想要一个内存高效的解决方案，请首先通过反复调用somefile.readline()检查文件中的行数，直到它返回“”，然后生成一个小于行数的随机数（比如n），回到文件开头，最后调用somefile.readline() n次。下一次调用somefile.readline()将返回所需的随机行。这种方法不会浪费内存来保存“不必要”的行。当然，如果您计划从文件中获取大量随机行，这将非常低效，最好像第一种方法那样将整个文件保存在内存中。

- pafcu

您还可以仅缓存文件中换行符的位置，这将使您能够使用单个 seek 命令跳转到特定行。 - Bugmaster

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- dcrosta · Accepted Answer

random模块定义了choice()函数，可以满足您的需求：

import random

words = [line.strip() for line in open('/etc/dictionaries-common/words')]
print(random.choice(words))

需要注意的是，这假设文件中的每个单词都独立在一行上。如果文件非常大，或者您经常执行此操作，则可能会发现不断重新读取文件会对应用程序的性能产生负面影响。