Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to join the nth column of all files in a directory?

Tags:

bash

awk

cut

paste

A for loop is way to slow. The files have 500k lines. I want to join specifically the 4th column of all files. Appending column after column to the right.

The columns in each file are separated by tab.

col1 col2 col3 col4 col5
a 0 0 -1 0.001
b 1 0  2 0.004
c 2 0 3 0

col1 col2 col3 col4 col5
c 2 0 -9 0.004
s 1 0  5 0.002
d 3 0 3 0.4

col1 col2 col3 col4 col5
r 2 1 0 0.4
j 1 1 1 0.2
r 3 1 2 0.1

I want:

file1 file2 file3
-1 -9 0
2 5 1
3 3 2

I tried first converting to .csv:

for file in $(ls) do awk '{$1=$1}1' OFS=',' ${file} > ${file}.csv done

And then doing this:

eval paste -d, $(printf "<(cut -d, -f4 %s) " *.csv)

But I get this error: paste: /dev/fd/19: Too many open files

I have to join 400 files of 500k lines each.

like image 682
Caterina Avatar asked Sep 07 '25 17:09

Caterina


1 Answers

Your OS doesn't allow you to paste that many files in one go. You'll have to break them up into smaller batches. Here's how to simply do one at a time.

for file in *.csv; do
    if [ -e tempfile ]; then
        paste -d, tempfile <(cut -d, -f4 "$file") >tempfile2
        mv tempfile2 tempfile
    else
        cut -d, -f4 "$file" >tempfile
    fi
done
mv tempfile result.csv

As an aside, don't use ls in scripts. You want simply

awk '{$1=$1}1' OFS=',' * > ${file}.csv

... but there is no reason to separately convert each file into CSV. You could fold both operations into one;

rm tempfile
for file in *; do
    case $file in tempfile | tempfile2 | result.csv) continue;; esac
    if [ -e tempfile ]; then
        paste -d, tempfile <(awk '{print $4}' "$file") >tempfile2
        mv tempfile2 tempfile
    else
        awk '{ print $4 }' "$file" >tempfile
    fi
done
mv tempfile result.csv
like image 100
tripleee Avatar answered Sep 10 '25 06:09

tripleee