从文件中随机选择行

Question

从文件中随机选择行

pythonperlshell

5

我有一堆文件，每个文件都有五行标题。在文件的其余部分，每对行形成一个条目。我需要从这些文件中随机选择条目。如何选择随机文件和随机条目（一对行，不包括标题）？

- AlgoMan

2

这些文件有多大？ - Felix Kling

1

头文件中是否告诉您文件中的条目数？在给定文件中，这些条目的大小是否固定？ - Aryabhatta

每个文件大约有1000000条记录。 - AlgoMan

3

请提供一个你正在处理的例子，以及你目前已经尝试过什么。 - Zaid

1

请参考以下问题，从文件中选择一行随机内容：https://dev59.com/kXVC5IYBdhLWcg3woCrN - Mark Ransom

显示剩余5条评论

7个回答

7

You may find perlfaq5 useful.

- Eugene Yarmash

2

值得注意的是，这是一种单遍算法，每次只在内存中保留两行。 - Schwern

我本来会将这个帖子标记为 CW 的，因为您（大概）不是这个 FAQ 条目的作者。 - Ether

1

+1 聪明的算法。我今天学到了一些东西。小提醒：现在我们应该放弃前面的"srand"。 - tsee

3

sed "1,5d" < FILENAME | sort -R | head -2

- zed_0xff

1

有趣。但是“-R”开关在哪里（可能是为了支持“随机化”）？我刚在Linux（RHEL5.4，coreutils 5.97...），Mac OS X（10.5.8）和FreeBSD（6.4）上检查过。 - Jim Dennis

3

Python解决方案 - 仅读取一次文件，占用较少内存

使用以下方式调用getRandomItems(file('myHuge.log'), 5, 2) - 将返回包含2行的列表

from random import randrange

def getRandomItems(f, skipFirst=0, numItems=1):
    for _ in xrange(skipFirst):
        f.next()
    n = 0; r = []
    while True:
        try:
            nxt = [f.next() for _ in range(numItems)]
        except StopIteration: break
        n += 1
        if not randrange(n):
            r = nxt
    return r

如果无法从 f 中获取可通过的第一个项目，则返回空列表。代码唯一的要求是参数 f 必须是迭代器（支持 next() 方法）。因此，我们可以传递与文件不同的内容，比如说，我们想查看分布情况：

>>> s={}
>>> for i in xrange(5000):
...     r = getRandomItems(iter(xrange(50)))[0]
...     s[r] = 1 + s.get(r,0)
... 
>>> for i in s: 
...     print i, '*' * s[i]
... 
0 ***********************************************************************************************
1 **************************************************************************************************************
2 ******************************************************************************************************
3 ***************************************************************************
4 *************************************************************************************************************************
5 ********************************************************************************
6 **********************************************************************************************
7 ***************************************************************************************
8 ********************************************************************************************
9 ********************************************************************************************
10 ***********************************************************************************************
11 ************************************************************************************************
12 *******************************************************************************************************************
13 *************************************************************************************************************
14 ***************************************************************************************************************
15 *****************************************************************************************************
16 ********************************************************************************************************
17 ****************************************************************************************************
18 ************************************************************************************************
19 **********************************************************************************
20 ******************************************************************************************
21 ********************************************************************************************************
22 ******************************************************************************************************
23 **********************************************************************************************************
24 *******************************************************************************************************
25 ******************************************************************************************
26 ***************************************************************************************************************
27 ***********************************************************************************************************
28 *****************************************************************************************************
29 ****************************************************************************************************************
30 ********************************************************************************************************
31 ********************************************************************************************
32 ****************************************************************************************************
33 **********************************************************************************************
34 ****************************************************************************************************
35 **************************************************************************************************
36 *********************************************************************************************
37 ***************************************************************************************
38 *******************************************************************************************************
39 **********************************************************************************************************
40 ******************************************************************************************************
41 ********************************************************************************************************
42 ************************************************************************************
43 ****************************************************************************************************************************
44 ****************************************************************************************************************************
45 ***********************************************************************************************
46 *****************************************************************************************************
47 ***************************************************************************************
48 ***********************************************************************************************************
49 ****************************************************************************************************************

- Nas Banov

1

答案使用Python编写。假设您可以将整个文件读入内存。

#using python 2.6
import sys
import os
import itertools
import random

def main(directory, num_files=5, num_entries=5):
    file_paths = os.listdir(directory)

    # get a random sampling of the available paths
    chosen_paths = random.sample(file_paths, num_files)

    for path in chosen_paths:
        chosen_entries = get_random_entries(path, num_entries)
        for entry in chosen_entries:
            # do something with your chosen entries
            print entry

def get_random_entries(file_path, num_entries):
    with open(file_path, 'r') as file:
        # read the lines and slice off the headers
        lines = file.readlines()[5:]

        # group the lines into pairs (i.e. entries)
        entries = list(itertools.izip_longest(*[iter(lines)]*2))

        # return a random sampling of entries
        return random.sample(entries, num_entries)

if __name__ == '__main__':
    #use optparse here to do fancy things with the command line args
    main(sys.argv[1:])

链接：itertools，random，optparse

- tgray

服务器上记录日志的Python版本为 2.4.3。 - AlgoMan

@EnTerr 如果你删除空格和注释，那么代码行数和你的答案一样；而这两者在你的答案中都缺少。 - Jon-Eric

噢，将所有行都加载到内存中是愚蠢的 - 假设有一个50MB大小的文件里有一百万行？ - Nas Banov

@EnTerr，我不会担心50MB或一百万行的问题。当我们接近500MB-1GB时，问题可能会开始出现，因为在任何给定的点上，文件内容可能会在内存中有两个副本（加上其他变量的一些开销）。 - tgray

0

另外还有两种方法： 1- 通过生成器（可能仍需要大量内存）： http://www.usrsb.in/Picking-Random-Items--Take-Two--Hacking-Python-s-Generators-.html 2- 通过聪明的查找（实际上是最佳方法）： http://www.regexprn.com/2008/11/read-random-line-in-large-file-in.html 我在这里复制了聪明的Jonathan Kupferman的代码：

#!/usr/bin/python

import os,random

filename="averylargefile"
file = open(filename,'r')

#Get the total file size
file_size = os.stat(filename)[6]

while 1:
      #Seek to a place in the file which is a random distance away
      #Mod by file size so that it wraps around to the beginning
      file.seek((file.tell()+random.randint(0,file_size-1))%file_size)

      #dont use the first readline since it may fall in the middle of a line
      file.readline()
      #this will return the next (complete) line from the file
      line = file.readline()

      #here is your random line in the file
      print line

- gaborous

0

另一个Python选项; 将所有文件的内容读入内存：

import random
import fileinput

def openhook(filename, mode):
    f = open(filename, mode)
    headers = [f.readline() for _ in range(5)]
    return f

num_entries = 3
lines = list(fileinput.input(openhook=openhook))
print random.sample(lines, num_entries)

- Matt Anderson

我的Python版本在记录所在的服务器上是2.4.3。 - AlgoMan

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- FMc · Accepted Answer

如果文件足够小，将行对读入内存并从该数据结构中随机选择。如果文件太大，Eugene Y提供了正确的答案：使用蓄水池抽样算法。

以下是该算法的直观解释。

Process the file line by line.
pick = line, with probability 1/N, where N = line number

换句话说，在第1行，我们将以1/1的概率选择第1行。在第2行，我们将更改选择为第2行，并以1/2的概率选择它。在第3行，我们将更改选择为第3行，并以1/3的概率选择它。以此类推。

对于一个直观的证明，想象一个有3行的文件：

        1            Pick line 1.
       / \
     .5  .5
     /     \
    2       1        Switch to line 2?
   / \     / \
 .67 .33 .33 .67
 /     \ /     \
2       3       1    Switch to line 3?

每个结果的概率：

Line 1: .5 * .67     = 1/3
Line 2: .5 * .67     = 1/3
Line 3: .5 * .33 * 2 = 1/3

从那里开始，其余的就是归纳。例如，假设该文件有4行。我们已经相信，截至第3行，到目前为止的每一行（1、2、3）都有同等机会成为我们的当前选择。当我们进入第4行时，它将有1/4的机会被选中--正好应该是这样，因此通过恰当的数量减少了先前3行的概率（1/3 * 3/4 = 1/4）。
这是Perl FAQ答案，适用于您的问题。
use strict; use warnings; # Ignore 5 lines. <> for 1 .. 5; # Use reservoir sampling to select pairs from remaining lines. my (@picks, $n); until (eof){ my @lines; $lines[$_] = <> for 0 .. 1; $n ++; @picks = @lines if rand($n) < 1; } print @picks;