The code below reads the user-selected input file entirely. This requires a lot of memory for very large (> 10 GB) files. I need to read a file line by line.
How can I read a file in Pyodide one line at a time?
<!doctype html>
<html>
<head>
<script src="https://cdn.jsdelivr.net/pyodide/v0.22.1/full/pyodide.js"></script>
</head>
<body>
<button>Analyze input</button>
<script type="text/javascript">
async function main() {
// Get the file contents into JS
const [fileHandle] = await showOpenFilePicker();
const fileData = await fileHandle.getFile();
const contents = await fileData.text();
// Create the Python convert toy function
let pyodide = await loadPyodide();
let convert = pyodide.runPython(`
from pyodide.ffi import to_js
def convert(contents):
return to_js(contents.lower())
convert
`);
let result = convert(contents);
console.log(result);
const blob = new Blob([result], {type : 'application/text'});
let url = window.URL.createObjectURL(blob);
var downloadLink = document.createElement("a");
downloadLink.href = url;
downloadLink.text = "Download output";
downloadLink.download = "out.txt";
document.body.appendChild(downloadLink);
}
const button = document.querySelector('button');
button.addEventListener('click', main);
</script>
</body>
</html>
The code is from this answer to question "Select and read a file from user's filesystem".
Based on the answer by rth, I used the code below. It still has 2 issues:
result
to be written into the output file, which is available for download to the user (see below, where for the example purposes it is replaced by a dummy string 'result'
).<!doctype html>
<html>
<head>
<script src="https://cdn.jsdelivr.net/pyodide/v0.22.1/full/pyodide.js"></script>
</head>
<body>
<button>Analyze input</button>
<script type="text/javascript">
async function main() {
// Create the Python convert toy function
let pyodide = await loadPyodide();
let convert = pyodide.runPython(`
from pyodide.ffi import to_js
def convert(contents):
for line in contents.split('\\n'):
print(len(line))
return to_js(contents.lower())
convert
`);
// Get the file contents into JS
const bytes_func = pyodide.globals.get('bytes');
const [fileHandle] = await showOpenFilePicker();
let fh = await fileHandle.getFile()
const stream = fh.stream();
const reader = stream.getReader();
// Do a loop until end of file
while( true ) {
const { done, value } = await reader.read();
if( done ) { break; }
handleChunk( value );
}
console.log( "all done" );
function handleChunk( buf ) {
console.log( "received a new buffer", buf.byteLength );
let result = convert(bytes_func(buf).decode('utf-8'));
}
const blob = new Blob(['result'], {type : 'application/text'});
let url = window.URL.createObjectURL(blob);
var downloadLink = document.createElement("a");
downloadLink.href = url;
downloadLink.text = "Download output";
downloadLink.download = "out.txt";
document.body.appendChild(downloadLink);
}
const button = document.querySelector('button');
button.addEventListener('click', main);
</script>
</body>
</html>
Given this input file with 100 characters per line:
perl -le 'for (1..1e5) { print "0" x 100 }' > test_100x1e5.txt
I am getting this console log output, indicating that lines are broken not at the newline:
received a new buffer 65536
648pyodide.asm.js:10 100
pyodide.asm.js:10 88
read_write_bytes_func.html:41 received a new buffer 2031616
pyodide.asm.js:10 12
20114pyodide.asm.js:10 100
pyodide.asm.js:10 89
read_write_bytes_func.html:41 received a new buffer 2097152
pyodide.asm.js:10 11
20763pyodide.asm.js:10 100
pyodide.asm.js:10 77
read_write_bytes_func.html:41 received a new buffer 2097152
pyodide.asm.js:10 23
20763pyodide.asm.js:10 100
pyodide.asm.js:10 65
read_write_bytes_func.html:41 received a new buffer 2097152
pyodide.asm.js:10 35
20763pyodide.asm.js:10 100
pyodide.asm.js:10 53
read_write_bytes_func.html:41 received a new buffer 1711392
pyodide.asm.js:10 47
16944pyodide.asm.js:10 100
pyodide.asm.js:10 0
read_write_bytes_func.html:37 all done
If I change from this:
const blob = new Blob(['result'], {type : 'application/text'});
to that:
const blob = new Blob([result], {type : 'application/text'});
then I get the error:
Uncaught (in promise) ReferenceError: result is not defined
at HTMLButtonElement.main (read_write_bytes_func.html:45:34)
The available memory in this environment is currently limited to 2GB so you would not be able to read a 10GB file entirely.
If you can process the file as a stream, line by line, you could try mounting a local folder where the file is using the File System Access API (currently only available in Chrome and Edge).
To mount a local folder in Pyodide,
const dirHandle = await showDirectoryPicker();
if ((await dirHandle.queryPermission({ mode: "readwrite" })) !== "granted") {
if (
(await dirHandle.requestPermission({ mode: "readwrite" })) !== "granted"
) {
throw Error("Unable to read and write directory");
}
}
const nativefs = await pyodide.mountNativeFS("/mount_dir", dirHandle);
then you can access it as a normal file from Pyodide,
pyodide.runPython(`
import os
print(os.listdir('/mount_dir'))
`);
You can then open this file path and iterate on lines as you would usually do in Python.
If you make any changes to this folder you need to run,
await nativefs.syncfs();
See the documentation for more details
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With