在 Pyodide 中逐行读取文件。

4
下面的代码完全读取用户选择的输入文件。这对于非常大(> 10 GB)的文件需要大量内存。我需要逐行读取文件。
在Pyodide中如何一次读取一行文件?
<!doctype html>
<html>
  <head>
      <script src="https://cdn.jsdelivr.net/pyodide/v0.22.1/full/pyodide.js"></script>
  </head>
  <body>
    <button>Analyze input</button>
    <script type="text/javascript">
      async function main() {
        // Get the file contents into JS
        const [fileHandle] = await showOpenFilePicker();
        const fileData = await fileHandle.getFile();
        const contents = await fileData.text();

        // Create the Python convert toy function
        let pyodide = await loadPyodide();
        let convert = pyodide.runPython(`
from pyodide.ffi import to_js
def convert(contents):
    return to_js(contents.lower())
convert
      `);

        let result = convert(contents);
        console.log(result);

        const blob = new Blob([result], {type : 'application/text'});

        let url = window.URL.createObjectURL(blob);

        var downloadLink = document.createElement("a");
        downloadLink.href = url;
        downloadLink.text = "Download output";
        downloadLink.download = "out.txt";
        document.body.appendChild(downloadLink);

      }
      const button = document.querySelector('button');
      button.addEventListener('click', main);
    </script>
  </body>
</html>

这段代码来自于这个回答中关于“从用户文件系统选择并读取文件”的问题


基于 rth的答案,我使用了以下代码。 它仍然有两个问题:

  • 块会将某些行分成几部分,如示例输入文件所示,该文件每行有100个字符。控制台日志(如下所示)显示块并非总是在换行符处断开行(因此,块中的行不是在换行符处中断)。
  • 我无法将变量result写入输出文件,该文件可供用户下载(请参见下文,为了演示目的,它被替换为虚拟字符串'result')。
<!doctype html>
<html>
  <head>
    <script src="https://cdn.jsdelivr.net/pyodide/v0.22.1/full/pyodide.js"></script>
  </head>
  <body>
    <button>Analyze input</button>
    <script type="text/javascript">
      async function main() {
          
          // Create the Python convert toy function
          let pyodide = await loadPyodide();
          let convert = pyodide.runPython(`
from pyodide.ffi import to_js
def convert(contents):
    for line in contents.split('\\n'):
        print(len(line))
    return to_js(contents.lower())
convert
      `);
          
          // Get the file contents into JS
          const bytes_func = pyodide.globals.get('bytes');                                               
          
          const [fileHandle] = await showOpenFilePicker();  
          let fh = await fileHandle.getFile()  
          const stream = fh.stream();  
          const reader = stream.getReader();
          // Do a loop until end of file


          while( true ) {
              const { done, value } = await reader.read();
              if( done ) { break; }
              handleChunk( value );
          }
          console.log( "all done" );


          function handleChunk( buf ) {
              console.log( "received a new buffer", buf.byteLength );
              let result = convert(bytes_func(buf).decode('utf-8'));
          }
          
          const blob = new Blob(['result'], {type : 'application/text'});
          
          let url = window.URL.createObjectURL(blob);
          
          var downloadLink = document.createElement("a");
          downloadLink.href = url;
          downloadLink.text = "Download output";
          downloadLink.download = "out.txt";
          document.body.appendChild(downloadLink);
          
      }
      const button = document.querySelector('button');
      button.addEventListener('click', main);
    </script>
  </body>
</html>

假设有一个每行包含100个字符的输入文件:

perl -le 'for (1..1e5) { print "0" x 100 }' > test_100x1e5.txt

我得到了这个控制台日志输出,表明行不是在换行符处断开的:

received a new buffer 65536
648pyodide.asm.js:10 100
pyodide.asm.js:10 88
read_write_bytes_func.html:41 received a new buffer 2031616
pyodide.asm.js:10 12
20114pyodide.asm.js:10 100
pyodide.asm.js:10 89
read_write_bytes_func.html:41 received a new buffer 2097152
pyodide.asm.js:10 11
20763pyodide.asm.js:10 100
pyodide.asm.js:10 77
read_write_bytes_func.html:41 received a new buffer 2097152
pyodide.asm.js:10 23
20763pyodide.asm.js:10 100
pyodide.asm.js:10 65
read_write_bytes_func.html:41 received a new buffer 2097152
pyodide.asm.js:10 35
20763pyodide.asm.js:10 100
pyodide.asm.js:10 53
read_write_bytes_func.html:41 received a new buffer 1711392
pyodide.asm.js:10 47
16944pyodide.asm.js:10 100
pyodide.asm.js:10 0
read_write_bytes_func.html:37 all done

如果我改成这个:
const blob = new Blob(['result'], {type : 'application/text'});

变成这样:

const blob = new Blob([result], {type : 'application/text'});

然后我会收到错误:

Uncaught (in promise) ReferenceError: result is not defined
    at HTMLButtonElement.main (read_write_bytes_func.html:45:34)
2个回答

2
如果您希望处理单个文件,另一个解决方案是使用流式JavaScript API并在Python中处理每个数据块。
对于UTF8编码的文本文件,部分解决方案可能如下所示:
const bytes_func = pyodide.globals.get('bytes');                                               
                                                       
const [fileHandle] = await showOpenFilePicker();  
let fh = await fileHandle.getFile()  
const stream = fh.stream();  
const reader = stream.getReader();
// Do a loop until and of file
const {done, value } = await reader.read()    
if (done) {
  // process a single chunk
  let chunk = bytes_func(value).decode('utf-8')
  // chunk is now a Python string proxied to JavaScript                 
}

每个块都是一些行,因此需要重新拆分它以在Python中获取行迭代器。我不确定它是否会在行的中间断开。

1

此环境中可用的内存目前仅限于2GB,因此您将无法完全读取10GB的文件。

如果您可以逐行处理文件流,您可以尝试使用文件系统访问API(目前仅在Chrome和Edge中提供)挂载包含该文件的本地文件夹。

要在Pyodide中挂载本地文件夹,请执行以下操作:

const dirHandle = await showDirectoryPicker();

if ((await dirHandle.queryPermission({ mode: "readwrite" })) !== "granted") {
  if (
    (await dirHandle.requestPermission({ mode: "readwrite" })) !== "granted"
  ) {
    throw Error("Unable to read and write directory");
  }
}

const nativefs = await pyodide.mountNativeFS("/mount_dir", dirHandle);

那么你就可以像访问普通文件一样从Pyodide中访问它。

pyodide.runPython(`
  import os
  print(os.listdir('/mount_dir'))
`);

你可以打开这个文件路径并像在Python中通常做的那样迭代行。

如果您对此文件夹进行任何更改,您需要运行:

await nativefs.syncfs();

请查看文档以获取更多详细信息


谢谢您的建议!我如何使用户能够选择特定的文件(而不仅仅是目录)?目前我是通过以下方式实现的:const [fileHandle] = await showOpenFilePicker(); const fileData = await fileHandle.getFile(); const contents = await fileData.text(); - Timur Shtatland

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接