Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to transpose fields with awk array? [duplicate]

I have a huge tab-separated file formatted like this

X column1 column2 column3
row1 0 1 2
row2 3 4 5
row3 6 7 8
row4 9 10 11

I would like to transpose it in an efficient way using only bash commands (I could write a ten or so lines Perl script to do that, but it should be slower to execute than the native bash functions). So the output should look like

X row1 row2 row3 row4
column1 0 3 6 9
column2 1 4 7 10
column3 2 5 8 11

I thought of a solution like this

cols=`head -n 1 input | wc -w`
for (( i=1; i <= $cols; i++))
do cut -f $i input | tr $'\n' $'\t' | sed -e "s/\t$/\n/g" >> output
done

But it's slow and doesn't seem the most efficient solution. I've seen a solution for vi in this post, but it's still over-slow. Any thoughts/suggestions/brilliant ideas? :-)

like image 879
Federico Giorgi Avatar asked May 07 '26 10:05

Federico Giorgi


2 Answers

awk '
{ 
    for (i=1; i<=NF; i++)  {
        a[NR,i] = $i
    }
}
NF>p { p = NF }
END {    
    for(j=1; j<=p; j++) {
        str=a[1,j]
        for(i=2; i<=NR; i++){
            str=str" "a[i,j];
        }
        print str
    }
}' file

output

$ more file
0 1 2
3 4 5
6 7 8
9 10 11

$ ./shell.sh
0 3 6 9
1 4 7 10
2 5 8 11

Performance against Perl solution by Jonathan on a 10000 lines file

$ head -5 file
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
1 0 1 2

$  wc -l < file
10000

$ time perl test.pl file >/dev/null

real    0m0.480s
user    0m0.442s
sys     0m0.026s

$ time awk -f test.awk file >/dev/null

real    0m0.382s
user    0m0.367s
sys     0m0.011s

$ time perl test.pl file >/dev/null

real    0m0.481s
user    0m0.431s
sys     0m0.022s

$ time awk -f test.awk file >/dev/null

real    0m0.390s
user    0m0.370s
sys     0m0.010s

EDIT by Ed Morton (@ghostdog74 feel free to delete if you disapprove).

Maybe this version with some more explicit variable names will help answer some of the questions below and generally clarify what the script is doing. It also uses tabs as the separator which the OP had originally asked for so it'd handle empty fields and it coincidentally pretties-up the output a bit for this particular case.

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
    for (rowNr=1;rowNr<=NF;rowNr++) {
        cell[rowNr,NR] = $rowNr
    }
    maxRows = (NF > maxRows ? NF : maxRows)
    maxCols = NR
}
END {
    for (rowNr=1;rowNr<=maxRows;rowNr++) {
        for (colNr=1;colNr<=maxCols;colNr++) {
            printf "%s%s", cell[rowNr,colNr], (colNr < maxCols ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk file
X       row1    row2    row3    row4
column1 0       3       6       9
column2 1       4       7       10
column3 2       5       8       11

The above solutions will work in any awk (except old, broken awk of course - there YMMV).

The above solutions do read the whole file into memory though - if the input files are too large for that then you can do this:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ printf "%s%s", (FNR>1 ? OFS : ""), $ARGIND }
ENDFILE {
    print ""
    if (ARGIND < NF) {
        ARGV[ARGC] = FILENAME
        ARGC++
    }
}
$ awk -f tst.awk file
X       row1    row2    row3    row4
column1 0       3       6       9
column2 1       4       7       10
column3 2       5       8       11

which uses almost no memory but reads the input file once per number of fields on a line so it will be much slower than the version that reads the whole file into memory. It also assumes the number of fields is the same on each line and it uses GNU awk for ENDFILE and ARGIND but any awk can do the same with tests on FNR==1 and END.

like image 104
ghostdog74 Avatar answered May 09 '26 01:05

ghostdog74


csvtk

-t uses tab as delimiter and -d specifies delimiter:

$ seq 4|paste - -|csvtk -t transpose
1   3
2   4
$ seq 4|paste -d, - -|csvtk -d, transpose
1,3
2,4

-l (--lazy-quotes) allows double quotes within a field which is not surrounded by double quotes, but I don't think there's any way to disable CSV-style escaping for double quotes in the output:

$ csvtk -t transpose<<<$'aa"bb\t2\n3\t4'
[ERRO] parse error on line 1, column 3: bare " in non-quoted-field
$ csvtk -lt transpose<<<$'aa"bb\t2\n3\t4'
"aa""bb"    3
2   4

Another caveat is that -l doesn't prevent removing double quotes around fields that don't need to be quoted in CSV:

$ csvtk -lt transpose<<<$'"1"\t2\n3\t4'
1   3
2   4

awk

Gawk version which uses arrays of arrays:

tp(){ awk '{for(i=1;i<=NF;i++)a[i][NR]=$i}END{for(i in a)for(j in a[i])printf"%s"(j==NR?RS:FS),a[i][j]}' "${1+FS=$1}";}

Plain awk version which uses multidimensional arrays (this took about twice as long to run in my benchmark):

tp(){ awk '{for(i=1;i<=NF;i++)a[i,NR]=$i}END{for(i=1;i<=NF;i++)for(j=1;j<=NR;j++)printf"%s"(j==NR?RS:FS),a[i,j]}' "${1+FS=$1}";}

macOS comes with a version of Brian Kerningham's nawk from 2007 which doesn't support arrays of arrays.

To use space as a field separator without collapsing sequences of multiple spaces, use FS='[ ]'.

rs

rs is a BSD utility which also comes with macOS, but it should be available from package managers on other platforms. It is named after the reshape function in APL.

Use sequences of spaces and tabs as column separator:

rs -T

Use tab as column separator:

rs -c -C -T

Use comma as column separator:

rs -c, -C, -T

-c changes the input column separator and -C changes the output column separator. A lone -c or -C sets the separator to tab. -T transposes rows and columns.

Do not use -t instead of -T, because it selects the number of output columns so that the output lines fill the width of the display (which is 80 characters by default but which can be changed with -w).

When an output column separator is specified using -C, an extra column separator character is added to the end of each row, but you can remove it with sed:

$ seq 4|paste -d, - -|rs -c, -C, -T
1,3,
2,4,
$ seq 4|paste -d, - -|rs -c, -C, -T|sed s/.\$//
1,3
2,4

rs -T determines the number of columns based on the number of columns on the first row, so it produces the wrong result when the first line ends with one or more empty columns:

$ rs -c, -C, -T<<<$'1,\n3,4'
1,3,4,

R

The t function transposes a matrix or a dataframe:

Rscript -e 'write.table(t(read.table("stdin",sep=",",quote="",comment.char="")),sep=",",quote=F,col.names=F,row.names=F)'

If you replace Rscript -e with R -e, then it echoes the code that is being run to STDOUT, and it also results in the error ignoring SIGPIPE signal if the R command is followed by a command like head -n1 which exits before it has read the whole STDIN.

quote="" can be removed if the input doesn't contain double quotes or single quotes, and comment.char="" can be removed if the input doesn't contain lines that start with a hash character.

For large input, data.table::fread is usually faster than read.table and data.table::fwrite is faster than write.table:

$ seq 1e6|awk 'ORS=NR%1e3?FS:RS'>a
$ time Rscript --no-init-file -e 'write.table(t(read.table("a")),quote=F,col.names=F,row.names=F)'>/dev/null
real  0m1.061s
user  0m0.983s
sys   0m0.074s
$ time Rscript --no-init-file -e 'write.table(t(data.table::fread("a")),quote=F,col.names=F,row.names=F)'>/dev/null

real  0m0.599s
user  0m0.535s
sys   0m0.048s
$ time Rscript --no-init-file -e 'data.table::fwrite(t(data.table::fread("a")),sep=" ",col.names=F)'>/dev/null
x being coerced from class: matrix to data.table

real  0m0.375s
user  0m0.296s
sys   0m0.073s

jq

tp(){ jq -R .|jq --arg x "${1-$'\t'}" -sr 'map(./$x)|transpose|map(join($x))[]';}

jq -R . prints each input line as a JSON string literal, -s (--slurp) creates an array for the input lines after parsing each line as JSON, and -r (--raw-output) outputs the contents of strings instead of JSON string literals. The / operator is overloaded to split strings.

Ruby

ruby -e'STDIN.map{|x|x.chomp.split(",",-1)}.transpose.each{|x|puts x*","}'

The -1 argument to split disables discarding empty fields at the end:

$ ruby -e'p"a,,".split(",")'
["a"]
$ ruby -e'p"a,,".split(",",-1)'
["a", "", ""]

Function form:

$ tp(){ ruby -e's=ARGV[0];STDIN.map{|x|x.chomp.split(s==" "?/ /:s,-1)}.transpose.each{|x|puts x*s}' -- "${1-$'\t'}";}
$ seq 4|paste -d, - -|tp ,
1,3
2,4

The function above uses s==" "?/ /:s because when the argument to the split function is a single space, it enables awk-like special behavior where strings are split based on contiguous runs of spaces and tabs:

$ ruby -e'p" a  \tb ".split(" ",-1)'
["a", "b", ""]
$ ruby -e'p" a  \tb ".split(/ /,-1)'
["", "a", "", "\tb", ""]

C++

Compile with cpp -O3 -o tp tp.cpp:

#include<iostream>
#include<vector>
#include<sstream>

using namespace std;

int main(int argc,char**argv){
  vector<vector<string> >table;
  vector<string>row;
  string line,field;
  char sep=argc==1?'\t':argv[1][0];

  while(getline(cin,line)){
    row.clear();
    stringstream str(line);
    while(getline(str,field,sep))row.push_back(field);
    table.push_back(row);
  }

  int rows=table.size(),cols=table[0].size();
  for(int i=0;i<cols;i++){
    cout<<table[0][i];
    for(int j=1;j<rows;j++)cout<<sep<<table[j][i];
    cout<<endl;
  }
}

Benchmark

csvtk was the fastest and jq was the slowest:

$ seq 1e6|awk 'ORS=NR%1e3?"\t":"\n"'>test
$ TIMEFORMAT=%R
$ time ./tp<test>/dev/null # first run of C++ program is slower
0.895
$ time ./tp<test>/dev/null
0.520
$ time csvtk -t transpose<test>/dev/null
0.142
$ time rs -c -C -T<test|sed $'s/\t$//'>/dev/null
0.587
$ time gawk -F\\t '{for(i=1;i<=NF;i++)a[i][NR]=$i}END{for(i in a)for(j in a[i])printf"%s"(j==NR?RS:FS),a[i][j]}' test>/dev/null
1.119
$ time awk -F\\t '{for(i=1;i<=NF;i++)a[i,NR]=$i}END{for(i=1;i<=NF;i++)for(j=1;j<=NR;j++)printf"%s"(j==NR?RS:FS),a[i,j]}'<test>/dev/null
1.858
$ time jq -R . test|jq --arg x "${1-$'\t'}" -sr 'map(./$x)|transpose|map(join($x))[]'>/dev/null
3.604
$ time ruby -e'STDIN.map{|x|x.chomp.split("\t",-1)}.transpose.each{|x|puts x*"\t"}'<test>/dev/null
0.492
like image 23
nisetama Avatar answered May 08 '26 23:05

nisetama



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!