在Scala中修改大文件

Question

在Scala中修改大文件

7

我正在尝试在Scala中修改一个大的PostScript文件（有些文件大小达到1GB）。该文件是一组批处理，每个批处理包含代表批处理编号、页面数量等的代码。

我需要：

1. 搜索批处理代码（始终从文件中的同一行开始）。 2. 计算到下一个批处理代码的页面数。 3. 修改批处理代码以包括每个批处理中的页面数。 4. 将新文件保存在不同的位置。

我的当前解决方案使用两个迭代器(iterA和iterB)，这两个迭代器都是通过Source.fromFile("file.ps").getLines创建的。第一个迭代器(iterA)在while循环中遍历到批处理代码的开头(iterB.next每次被调用)。然后，iterB继续搜索直到下一个批处理代码(或文件末尾)，并计算它经过的页面数。接着，它会更新iterA位置处的批处理代码，然后重复此过程。

这似乎非常不像Scala，并且我仍然没有设计出将这些更改保存到新文件中的好方法。

对于这个问题，有什么好的方法吗？我应该完全放弃迭代器吗？最好不要一次性将整个输入或输出存入内存中。

谢谢！

- Andrew Conner

3个回答

1

如果你不追求函数式Scala启示，我建议使用java.util.Scanner#findWithinHorizon更加命令式的风格。我的例子相当幼稚，需要两次迭代输入。

val scanner = new Scanner(inFile)

val writer = new BufferedWriter(...)

def loop() = {
  // you might want to limit the horizon to prevent OutOfMemoryError
  Option(scanner.findWithinHorizon(".*YOUR-BATCH-MARKER", 0)) match {
    case Some(batch) =>
      val pageCount = countPages(batch)
      writePageCount(writer, pageCount)
      writer.write(batch)        
      loop()

    case None =>
  }
}

loop()
scanner.close()
writer.close()

- MxFr

0

也许你可以有效地使用 span 和 duplicate。假设迭代器位于批次的开头，您可以获取下一批之前的 span，复制它以便您可以计算页面数，写入修改后的批次行，然后使用复制的迭代器写入页面。然后递归处理下一批...

def batch(i: Iterator[String]) {
  if (i.hasNext) {
    assert(i.next() == "batch")
    val (current, next) = i.span(_ != "batch")
    val (forCounting, forWriting) = current.duplicate
    val count = forCounting.filter(_ == "p").size
    println("batch " + count)
    forWriting.foreach(println)
    batch(next)
  }
}

假设以下输入：

val src = Source.fromString("head\nbatch\np\np\nbatch\np\nbatch\np\np\np\n")

你将迭代器定位在批处理的开头，然后处理这些批次：

val (head, next) = src.getLines.span(_ != "batch")
head.foreach(println)
batch(next)

这将打印：

head
batch 2
p
p
batch 1
p
batch 3
p
p
p

- huynhjl

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- stephenjudkins · Accepted Answer

您可能可以使用Scala的Stream类来实现此操作。我假设您不介意一次性将一个“批次”保存在内存中。

import scala.annotation.tailrec
import scala.io._

def isBatchLine(line:String):Boolean = ...

def batchLine(size: Int):String = ...

val it = Source.fromFile("in.ps").getLines
// cannot use it.toStream here because of SI-4835
def inLines = Stream.continually(i).takeWhile(_.hasNext).map(_.next)

// Note: using `def` instead of `val` here means we don't hold
// the entire stream in memory
def batchedLinesFrom(stream: Stream[String]):Stream[String] = {
  val (batch, remainder) = stream span { !isBatchLine(_) }
  if (batch.isEmpty && remainder.isEmpty) { 
    Stream.empty
  } else {
    batchLine(batch.size) #:: batch #::: batchedLinesFrom(remainder.drop(1))
  }
}

def newLines = batchedLinesFrom(inLines dropWhile isBatchLine)

val ps = new java.io.PrintStream(new java.io.File("out.ps"))

newLines foreach ps.println

ps.close()