Groovy：从文件中读取一系列行

Question

Groovy：从文件中读取一系列行

14

我有一个文本文件，里面有大约200万行的数据。使用下面的代码片段可以很容易地遍历整个文件，但这不是我需要的；-）

def f = new File("input.txt")
f.eachLine() {
    // Some code here
}

我需要从文件中只读取特定范围的行。是否有一种类似于这样的指定起始行和结束行的方法（伪代码）？我想避免在选择范围之前使用readLines()将所有行加载到内存中。

// Read all lines from 4 to 48
def f = new File("input.txt")
def start = 4
def end = 48
f.eachLine(start, end) {
    // Some code here
}

如果Groovy无法实现此功能，欢迎使用任何Java解决方案 :-)

干杯，罗伯特

- Robert Strauch

1

有没有位置元素指示行的起始位置？ - Aaron Saunders

不，这只是一个平面的 ID 列表，就像这样（由 CRLF 分隔）： id_1 id_2 ... id_2000000 - Robert Strauch

每行长度相同吗？ - dogbane

是的，实际上所有行的长度都相同（9个字符）。 - Robert Strauch

9个回答

5

这里有一个Groovy解决方案。不幸的是，在start之后，它将读取文件的每一行。

def start = 4
def end = 48

new File("input.txt").eachLine(start) {lineNo, line ->

    if (lineNo <= end) {
        // Process the line
    }
}

- Dónal

4

Groovy现在有可能从某些特殊行开始。以下是来自File文档的两个引用。

Object eachLine(int firstLine, Closure closure) 

Object eachLine(String charset, int firstLine, Closure closure)

- Gangnus

3

从文档里面我了解到，firstLine被用来确定一个数字表示第一行（你可以从1或0开始计数），而不是读取开始的行。 - ajurasz

3

我不相信有任何“神奇”的方法可以跳转到文件中的任意“行”。行仅由换行符定义，因此如果没有实际读取文件，则无法知道这些字符在哪里。我认为你有两个选择：

按照Mark Peter的回答，使用BufferedReader一次读取一行文件，直到达到所需行。这显然会很慢。
找出下一次读取需要从哪个字节（而不是行）开始，并使用类似RandomAccessFile的方法直接跳转到该点。是否能够有效地知道正确的字节数取决于您的应用程序。例如，如果您顺序读取文件中的每一部分，则只需记录您离开的位置即可。如果所有行的长度都为L个字节，则到达第N行只是寻求到位置N*L的问题。如果这是您经常重复的操作，则可以进行一些预处理：例如，一次读取整个文件并在内存中的HashMap中记录每行的起始位置。下次需要转到第N行时，只需在HashMap中查找它的位置并直接跳转到该点。

- Yevgeniy Brikman

1

@smartnut007 在你给我的回答打负分并推销你自己的回答之前，你可能需要重新阅读一下问题。Straurob 正在寻找一种方法，在不必读取前面 1-N 行的情况下跳转到文件中特定的第 N 行。我在我的回答中讨论了为什么这是有问题的以及可能的解决方法。你在你的回答中所做的方式可能是 Groovy 语法的一个好用法，但它需要读取前面的 1-N 行，所以它是错误的答案。 - Yevgeniy Brikman

@smartnut007你说得对，可能这是一个相当“通用”的答案，但实际上它对我很有帮助，特别是因为我没有考虑跳过N个字节。这就是为什么我将此帖子选为答案的原因。 - Robert Strauch

@straurob 当然，Unicode字符可能会破坏“跳过N个字节”的事情。 - tim_yates

@tim_yates 没错。不过我很幸运，这个文件的编码不会改变 :-) - Robert Strauch

2

这应该可以解决问题。我相信这不会读取“end”后的任何行。

def readRange = {file ->
    def start = 10
    def end = 20
    def fileToRead = new File(file)
    fileToRead.eachLine{line, lineNo = 0 ->
        lineNo++
        if(lineNo > end) {
            return
        }
        if(lineNo >= start) {
            println line                
        }            
    }
}

- Vinay

这个方法适用于我之前解决的一个不同的问题。我不知道为什么，但我甚至没有执行 lineNo++，它就神奇地自增了。 - Sundeep

2

在 Groovy 中，您可以使用 Category。

class FileHelper {
    static eachLineInRange(File file, IntRange lineRange, Closure closure) {
        file.withReader { r->
            def line
            for(; (line = r.readLine()) != null;) {
                def lineNo = r.lineNumber
                if(lineNo < lineRange.from) continue
                if(lineNo > lineRange.to) break
                closure.call(line, lineNo)
            }
        }
    }
}

def f = '/path/to/file' as File
use(FileHelper) {
    f.eachLineInRange(from..to){line, lineNo ->
        println "$lineNo) $line"
    }
}

或者 ExpandoMetaClass

File.metaClass.eachLineInRange = { IntRange lineRange, Closure closure ->
    delegate.withReader { r->
        def line
        for(; (line = r.readLine()) != null;) {
            def lineNo = r.lineNumber
            if(lineNo < lineRange.from) continue
            if(lineNo > lineRange.to) break
            closure.call(line, lineNo)
        }
    }
}

def f = '/path/to/file' as File
f.eachLineInRange(from..to){line, lineNo ->
    println "$lineNo) $line"
}

在这个解决方案中，您按顺序从文件中逐行读取，但不会将它们全部保存在内存中。

- Jarek Przygódzki

1

你必须迭代从开头到达你的起始位置，但可以使用 LineNumberReader（而不是 BufferedReader），因为它会为你跟踪行号。

    final int start = 4;
    final int end = 48;

    final LineNumberReader in = new LineNumberReader(new FileReader(filename));
    String line=null;
    while ((line = in.readLine()) != null && in.getLineNumber() <= end) {
        if (in.getLineNumber() >= start) {
            //process line
        }
    }

- dogbane

这使得代码更加简洁，+1。 - Mark Peters

1

感谢您提供的所有提示。根据您所写的内容，我拼凑出了自己的一段代码，它似乎正在工作。虽然不够优雅，但它达到了它的目的 :-)

def f = new RandomAccessFile("D:/input.txt", "r")
def start = 3
def end = 6
def current = start-1
def BYTE_OFFSET = 11
def resultList = []

if ((end*BYTE_OFFSET) <= f.length()) {
    while ((current*BYTE_OFFSET) < (end*BYTE_OFFSET)) {
        f.seek(current*BYTE_OFFSET)
        resultList << f.readLine()
        current++
    }
}

- Robert Strauch

0

这里是另一个使用LineIterator和FileUtils从Commons/IO的Java解决方案：

public static Collection<String> readFile(final File f,
    final int startOffset,
    final int lines) throws IOException{
    final LineIterator it = FileUtils.lineIterator(f);
    int index = 0;
    final Collection<String> coll = new ArrayList<String>(lines);
    while(index++ < startOffset + lines && it.hasNext()){
        final String line = it.nextLine();
        if(index >= startOffset){
            coll.add(line);
        }
    }
    it.close();
    return coll;
}

- Sean Patrick Floyd

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mark Peters · Accepted Answer

Java解决方案:

BufferedReader r = new BufferedReader(new FileReader(f));
String line;
for ( int ln = 0; (line = r.readLine()) != null && ln <= end; ln++ ) {
    if ( ln >= start ) {
        //Some code here
    }
}

真恶心，是吧？

不幸的是，除非你的行是固定长度的，否则你将无法有效地跳到第start行，因为每行都可能是任意长度的，因此需要读取所有数据。不过，这并不排除有一个更好的解决方案。

Java 8

值得一提的是，我们可以使用流来高效地实现这个功能：

int start = 5;
int end = 12;
Path file = Paths.get("/tmp/bigfile.txt");

try (Stream<String> lines = Files.lines(file)) {
    lines.skip(start).limit(end-start).forEach(System.out::println);
}

由于流是惰性求值的，它只会读取包括 end 在内的行（加上任何它选择进行的内部缓冲）。