Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to implement the Unix "paste" command in Node.js without loading whole files into memory?

The basic Unix paste can be implemented in Python like (the example works with two files only; Unix paste works with multiple files):

def paste(fn1, fn2):
  with open(fn1) as f1:
    with open(fn2) as f2:
      for l1 in f1:
        l2 = f2.readline()
        if l2 != None:
          print(l1[:-1] + "\t" + l2[:-1])
        else:
          print(l1[:-1])
      for l2 in f2:
        print("\t" + l2[:-1])

import sys
if __name__ == "__main__":
  if len(sys.argv) >= 3:
    paste(sys.argv[1], sys.argv[2])

The task is to implement the same functionality in Node.js. Importantly, as the input file can be huge, the implementation should read the input file line by line, not reading the entire file into the memory. I want to see how to achieve this with built-in Node functionality without external packages.

Note that it is easy to implement Unix paste with synchronous I/O as is shown in the Python example, but Node doesn't provide synchronous I/O for line reading. Meanwhile, there are ways to read one file by line with asynchronous I/O, but jointly reading two files is harder because the two streams are not synchronized.

So far the only solution I can think of is to implement synchronous line reading using the basic read API. Dave Newton pointed out in the comment that the npm n-readlines package implements this approach in 100+ lines. Because n-readlines inspects each byte to find line endings, I suspect it is inefficient and thus did a microbenchmark with results shown in the table below. For line reading (not for this task), n-readlines is 3 times as slow as a Node Readline implementation and is an order of magnitude slower than built-in line reading in Python, Perl or mawk.

What is the proper way to implement Unix paste? N-readlines is using synchronous APIs. Would a good async solution be cleaner and faster?

Language Runtime Version Elapsed (s) User (s) Sys (s) Code
JavaScript node 21.5.0 6.30 5.33 0.90 lc-node.js
node 21.5.0 22.34 20.41 2.24 lc-n-readlines.js
bun 1.0.20 4.91 5.30 1.47 lc-node.js
bun 1.0.20 21.16 19.22 3.37 lc-n-readlines.js
k8 1.0 1.49 1.06 0.37 lc-k8.js
C clang 15.0.0 0.71 0.35 0.35 lc-c.c
python python 3.11.17 3.48 2.85 0.62 lc-python.py
perl perl 5.34.3 1.70 1.13 0.57 lc-perl.pl
awk mawk 1.3.4 2.08 1.27 0.80 lc-awk.awk
apple awk ? 90.06 87.90 1.12 lc-awk.awk
like image 375
user172818 Avatar asked Oct 17 '25 20:10

user172818


1 Answers

import { open as fsOpenAsync } from 'node:fs/promises'
import { createWriteStream } from 'node:fs'

const filenames = ['a.txt', 'b.txt', 'c.txt']
const outname = 'out.txt'

await paste(filenames, outname)

/**
 * Read multiple files line by line and write lines concatenated by `\t`
 */
async function paste(from: string[], to: string) {
  const files = await Promise.all(filenames.map(fn => fsOpenAsync(fn)))
  const zip = zipAsyncs(files.map(f => f.readLines()[Symbol.asyncIterator]()))
  const writeStream = createWriteStream(to, { flags: 'w' })
  for await (const lines of zip)
    writeStream.write(`${lines.map(e => e ?? '').join('\t')}\n`)
  writeStream.close()
  await Promise.all(files.map(f => f.close()))
}

/**
 * Zip multiple async iterables, returning `undefined` for missing values
 * @template {T}
 * @param {AsyncIterator<T>[]} its
 * @returns {AsyncGenerator<IteratorResult<T | undefined, any>[]>}
 */
async function* zipAsyncs(its) {
  while (true) {
    const results = await Promise.all(its.map(e => e.next()))
    yield results.map(r => r.value)
    if (results.every(r => r.done))
      return
  }
}
like image 96
Dimava Avatar answered Oct 19 '25 11:10

Dimava