Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to separate column using character length with AWK

Tags:

bash

awk

I have a data, which is not formatted according to proper field separator (which is awk good at). But, what I know is, the data is fixed width.

 NODE     S1           S2           S3           SINT         SEQV    
   1  0.14919     -0.58396E-001-0.71230      0.86149      0.77873     
   2  0.56037E-001 0.23261E-002-0.37154      0.42757      0.40341     
   3  0.52036E-001 0.19762E-001-0.27222      0.32426      0.30939     
   4  0.59765E-001 0.22059E-001-0.24529      0.30505      0.28806     
   5  0.70704E-001-0.51976E-002-0.13862      0.20932      0.18354     
   6  0.11906      0.44607E-001-0.17493      0.29399      0.26474     
   7  0.25540      0.95993E-002-0.43110      0.68650      0.60246     
   8  0.52246E-001-0.47008E-001-0.35167      0.40391      0.36456     
   9  0.32215E-001-0.62291E-001-0.28800      0.32021      0.28497     
  10  0.28072E-001-0.68269E-001-0.28304      0.31111      0.27586     
  11  0.25990E-001-0.78663E-001-0.28626      0.31225      0.27527     
  12  0.26657E-001-0.79217E-001-0.29507      0.32173      0.28400     

The expected output is something like this (the numbers in other columns can have format like in S2, in other parts of the data):

  NODE       S1           S2           S3           SINT         SEQV    
   1    0.14919       -0.58396E-001  -0.71230      0.86149      0.77873     
   2    0.56037E-001   0.23261E-002  -0.37154      0.42757      0.40341     
   3    0.52036E-001   0.19762E-001  -0.27222      0.32426      0.30939     
   4    0.59765E-001   0.22059E-001  -0.24529      0.30505      0.28806     
   5    0.70704E-001  -0.51976E-002  -0.13862      0.20932      0.18354     
   6    0.11906        0.44607E-001  -0.17493      0.29399      0.26474     
   7    0.25540        0.95993E-002  -0.43110      0.68650      0.60246     
   8    0.52246E-001  -0.47008E-001  -0.35167      0.40391      0.36456     
   9    0.32215E-001  -0.62291E-001  -0.28800      0.32021      0.28497     
  10    0.28072E-001  -0.68269E-001  -0.28304      0.31111      0.27586     
  11    0.25990E-001  -0.78663E-001  -0.28626      0.31225      0.27527     
  12    0.26657E-001  -0.79217E-001  -0.29507      0.32173      0.28400     

The main problem is, sometimes the column are separated by space, sometime no space ('-' sign occupies the space), that is, no proper field separator. I found a similar question here, but that data was consistent in terms of field separator. There might be two ways, according to my thinking,

  1. Use character length. For example, starting from the second line, the characters are either 74 or 75 per line.
  2. Use field width: Don't know if it's possible with AWK.

I'm new to AWK, I know it might be very easy with some other tools, but I would like to know if it's possible to separate/extract these columns using awk. I am using awk in terminal in MacOS.

like image 816
massisenergy Avatar asked Oct 24 '25 16:10

massisenergy


2 Answers

You can split fields in a fixed width file in awk by specifying FIELDWIDTHS in your BEGIN block:

$ cat test
NODE     S1           S2           S3           SINT         SEQV
   1  0.14919     -0.58396E-001-0.71230      0.86149      0.77873
   2  0.56037E-001 0.23261E-002-0.37154      0.42757      0.40341
   3  0.52036E-001 0.19762E-001-0.27222      0.32426      0.30939
   4  0.59765E-001 0.22059E-001-0.24529      0.30505      0.28806
   5  0.70704E-001-0.51976E-002-0.13862      0.20932      0.18354
   6  0.11906      0.44607E-001-0.17493      0.29399      0.26474
   7  0.25540      0.95993E-002-0.43110      0.68650      0.60246
   8  0.52246E-001-0.47008E-001-0.35167      0.40391      0.36456
   9  0.32215E-001-0.62291E-001-0.28800      0.32021      0.28497
  10  0.28072E-001-0.68269E-001-0.28304      0.31111      0.27586
  11  0.25990E-001-0.78663E-001-0.28626      0.31225      0.27527
  12  0.26657E-001-0.79217E-001-0.29507      0.32173      0.28400

$ awk 'BEGIN{ FIELDWIDTHS= "5 13 13 13 13"; OFS="|"}{$1=$1}1' test
NODE |    S1       |    S2       |    S3       |    SINT
   1 | 0.14919     |-0.58396E-001|-0.71230     | 0.86149
   2 | 0.56037E-001| 0.23261E-002|-0.37154     | 0.42757
   3 | 0.52036E-001| 0.19762E-001|-0.27222     | 0.32426
   4 | 0.59765E-001| 0.22059E-001|-0.24529     | 0.30505
   5 | 0.70704E-001|-0.51976E-002|-0.13862     | 0.20932
   6 | 0.11906     | 0.44607E-001|-0.17493     | 0.29399
   7 | 0.25540     | 0.95993E-002|-0.43110     | 0.68650
   8 | 0.52246E-001|-0.47008E-001|-0.35167     | 0.40391
   9 | 0.32215E-001|-0.62291E-001|-0.28800     | 0.32021
  10 | 0.28072E-001|-0.68269E-001|-0.28304     | 0.31111
  11 | 0.25990E-001|-0.78663E-001|-0.28626     | 0.31225
  12 | 0.26657E-001|-0.79217E-001|-0.29507     | 0.32173

$ awk 'BEGIN{ FIELDWIDTHS= "5 13 13 13 13"; OFS="\t"}{$1=$1}1' test
NODE        S1              S2              S3              SINT
   1     0.14919        -0.58396E-001   -0.71230         0.86149
   2     0.56037E-001    0.23261E-002   -0.37154         0.42757
   3     0.52036E-001    0.19762E-001   -0.27222         0.32426
   4     0.59765E-001    0.22059E-001   -0.24529         0.30505
   5     0.70704E-001   -0.51976E-002   -0.13862         0.20932
   6     0.11906         0.44607E-001   -0.17493         0.29399
   7     0.25540         0.95993E-002   -0.43110         0.68650
   8     0.52246E-001   -0.47008E-001   -0.35167         0.40391
   9     0.32215E-001   -0.62291E-001   -0.28800         0.32021
  10     0.28072E-001   -0.68269E-001   -0.28304         0.31111
  11     0.25990E-001   -0.78663E-001   -0.28626         0.31225
  12     0.26657E-001   -0.79217E-001   -0.29507         0.32173

Probably have to tweak that a bit for whatever constitutes an actual field here, but that's the gist of it.

like image 112
JNevill Avatar answered Oct 26 '25 06:10

JNevill


You could also parse your number format with regex and insert space before each number, e.g. using GNU awk for gensub():

awk 'NR==1 { print; next } { print gensub(/(-?[0-9]+(\.[0-9]+)?([Ee][+-]?[0-9]+)?)/," \\1", "g") }' FILE

Output:

NODE     S1           S2           S3           SINT         SEQV    
    1   0.14919      -0.58396E-001 -0.71230       0.86149       0.77873     
    2   0.56037E-001  0.23261E-002 -0.37154       0.42757       0.40341     
    3   0.52036E-001  0.19762E-001 -0.27222       0.32426       0.30939     
    4   0.59765E-001  0.22059E-001 -0.24529       0.30505       0.28806     
    5   0.70704E-001 -0.51976E-002 -0.13862       0.20932       0.18354     
    6   0.11906       0.44607E-001 -0.17493       0.29399       0.26474     
    7   0.25540       0.95993E-002 -0.43110       0.68650       0.60246     
    8   0.52246E-001 -0.47008E-001 -0.35167       0.40391       0.36456     
    9   0.32215E-001 -0.62291E-001 -0.28800       0.32021       0.28497     
   10   0.28072E-001 -0.68269E-001 -0.28304       0.31111       0.27586     
   11   0.25990E-001 -0.78663E-001 -0.28626       0.31225       0.27527     
   12   0.26657E-001 -0.79217E-001 -0.29507       0.32173       0.28400  

or using any awk:

$ awk 'NR>1{ gsub(/-?[0-9]+(\.[0-9]+)?([Ee][+-]?[0-9]+)?/," &") } 1' file
 NODE     S1           S2           S3           SINT         SEQV
    1   0.14919      -0.58396E-001 -0.71230       0.86149       0.77873
    2   0.56037E-001  0.23261E-002 -0.37154       0.42757       0.40341
    3   0.52036E-001  0.19762E-001 -0.27222       0.32426       0.30939
    4   0.59765E-001  0.22059E-001 -0.24529       0.30505       0.28806
    5   0.70704E-001 -0.51976E-002 -0.13862       0.20932       0.18354
    6   0.11906       0.44607E-001 -0.17493       0.29399       0.26474
    7   0.25540       0.95993E-002 -0.43110       0.68650       0.60246
    8   0.52246E-001 -0.47008E-001 -0.35167       0.40391       0.36456
    9   0.32215E-001 -0.62291E-001 -0.28800       0.32021       0.28497
   10   0.28072E-001 -0.68269E-001 -0.28304       0.31111       0.27586
   11   0.25990E-001 -0.78663E-001 -0.28626       0.31225       0.27527
   12   0.26657E-001 -0.79217E-001 -0.29507       0.32173       0.28400

With this method you don't have to know the number of fields and field widths, therefore it can be more robust against data format changes.

like image 33
Andriy Makukha Avatar answered Oct 26 '25 05:10

Andriy Makukha



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!