How to separate column using character length with AWK

Question

I have a data, which is not formatted according to proper field separator (which is awk good at). But, what I know is, the data is fixed width.

 NODE     S1           S2           S3           SINT         SEQV    
   1  0.14919     -0.58396E-001-0.71230      0.86149      0.77873     
   2  0.56037E-001 0.23261E-002-0.37154      0.42757      0.40341     
   3  0.52036E-001 0.19762E-001-0.27222      0.32426      0.30939     
   4  0.59765E-001 0.22059E-001-0.24529      0.30505      0.28806     
   5  0.70704E-001-0.51976E-002-0.13862      0.20932      0.18354     
   6  0.11906      0.44607E-001-0.17493      0.29399      0.26474     
   7  0.25540      0.95993E-002-0.43110      0.68650      0.60246     
   8  0.52246E-001-0.47008E-001-0.35167      0.40391      0.36456     
   9  0.32215E-001-0.62291E-001-0.28800      0.32021      0.28497     
  10  0.28072E-001-0.68269E-001-0.28304      0.31111      0.27586     
  11  0.25990E-001-0.78663E-001-0.28626      0.31225      0.27527     
  12  0.26657E-001-0.79217E-001-0.29507      0.32173      0.28400

The expected output is something like this (the numbers in other columns can have format like in S2, in other parts of the data):

  NODE       S1           S2           S3           SINT         SEQV    
   1    0.14919       -0.58396E-001  -0.71230      0.86149      0.77873     
   2    0.56037E-001   0.23261E-002  -0.37154      0.42757      0.40341     
   3    0.52036E-001   0.19762E-001  -0.27222      0.32426      0.30939     
   4    0.59765E-001   0.22059E-001  -0.24529      0.30505      0.28806     
   5    0.70704E-001  -0.51976E-002  -0.13862      0.20932      0.18354     
   6    0.11906        0.44607E-001  -0.17493      0.29399      0.26474     
   7    0.25540        0.95993E-002  -0.43110      0.68650      0.60246     
   8    0.52246E-001  -0.47008E-001  -0.35167      0.40391      0.36456     
   9    0.32215E-001  -0.62291E-001  -0.28800      0.32021      0.28497     
  10    0.28072E-001  -0.68269E-001  -0.28304      0.31111      0.27586     
  11    0.25990E-001  -0.78663E-001  -0.28626      0.31225      0.27527     
  12    0.26657E-001  -0.79217E-001  -0.29507      0.32173      0.28400

The main problem is, sometimes the column are separated by space, sometime no space ('-' sign occupies the space), that is, no proper field separator. I found a similar question here, but that data was consistent in terms of field separator. There might be two ways, according to my thinking,

Use character length. For example, starting from the second line, the characters are either 74 or 75 per line.
Use field width: Don't know if it's possible with AWK.

I'm new to AWK, I know it might be very easy with some other tools, but I would like to know if it's possible to separate/extract these columns using awk. I am using awk in terminal in MacOS.

JNevill · Accepted Answer

You can split fields in a fixed width file in awk by specifying FIELDWIDTHS in your BEGIN block:

$ cat test
NODE     S1           S2           S3           SINT         SEQV
   1  0.14919     -0.58396E-001-0.71230      0.86149      0.77873
   2  0.56037E-001 0.23261E-002-0.37154      0.42757      0.40341
   3  0.52036E-001 0.19762E-001-0.27222      0.32426      0.30939
   4  0.59765E-001 0.22059E-001-0.24529      0.30505      0.28806
   5  0.70704E-001-0.51976E-002-0.13862      0.20932      0.18354
   6  0.11906      0.44607E-001-0.17493      0.29399      0.26474
   7  0.25540      0.95993E-002-0.43110      0.68650      0.60246
   8  0.52246E-001-0.47008E-001-0.35167      0.40391      0.36456
   9  0.32215E-001-0.62291E-001-0.28800      0.32021      0.28497
  10  0.28072E-001-0.68269E-001-0.28304      0.31111      0.27586
  11  0.25990E-001-0.78663E-001-0.28626      0.31225      0.27527
  12  0.26657E-001-0.79217E-001-0.29507      0.32173      0.28400

$ awk 'BEGIN{ FIELDWIDTHS= "5 13 13 13 13"; OFS="|"}{$1=$1}1' test
NODE |    S1       |    S2       |    S3       |    SINT
   1 | 0.14919     |-0.58396E-001|-0.71230     | 0.86149
   2 | 0.56037E-001| 0.23261E-002|-0.37154     | 0.42757
   3 | 0.52036E-001| 0.19762E-001|-0.27222     | 0.32426
   4 | 0.59765E-001| 0.22059E-001|-0.24529     | 0.30505
   5 | 0.70704E-001|-0.51976E-002|-0.13862     | 0.20932
   6 | 0.11906     | 0.44607E-001|-0.17493     | 0.29399
   7 | 0.25540     | 0.95993E-002|-0.43110     | 0.68650
   8 | 0.52246E-001|-0.47008E-001|-0.35167     | 0.40391
   9 | 0.32215E-001|-0.62291E-001|-0.28800     | 0.32021
  10 | 0.28072E-001|-0.68269E-001|-0.28304     | 0.31111
  11 | 0.25990E-001|-0.78663E-001|-0.28626     | 0.31225
  12 | 0.26657E-001|-0.79217E-001|-0.29507     | 0.32173

$ awk 'BEGIN{ FIELDWIDTHS= "5 13 13 13 13"; OFS="	"}{$1=$1}1' test
NODE        S1              S2              S3              SINT
   1     0.14919        -0.58396E-001   -0.71230         0.86149
   2     0.56037E-001    0.23261E-002   -0.37154         0.42757
   3     0.52036E-001    0.19762E-001   -0.27222         0.32426
   4     0.59765E-001    0.22059E-001   -0.24529         0.30505
   5     0.70704E-001   -0.51976E-002   -0.13862         0.20932
   6     0.11906         0.44607E-001   -0.17493         0.29399
   7     0.25540         0.95993E-002   -0.43110         0.68650
   8     0.52246E-001   -0.47008E-001   -0.35167         0.40391
   9     0.32215E-001   -0.62291E-001   -0.28800         0.32021
  10     0.28072E-001   -0.68269E-001   -0.28304         0.31111
  11     0.25990E-001   -0.78663E-001   -0.28626         0.31225
  12     0.26657E-001   -0.79217E-001   -0.29507         0.32173

Probably have to tweak that a bit for whatever constitutes an actual field here, but that's the gist of it.

Andriy Makukha · Answer

You could also parse your number format with regex and insert space before each number, e.g. using GNU awk for gensub():

awk 'NR==1 { print; next } { print gensub(/(-?[0-9]+(\.[0-9]+)?([Ee][+-]?[0-9]+)?)/," \1", "g") }' FILE

Output:

NODE     S1           S2           S3           SINT         SEQV    
    1   0.14919      -0.58396E-001 -0.71230       0.86149       0.77873     
    2   0.56037E-001  0.23261E-002 -0.37154       0.42757       0.40341     
    3   0.52036E-001  0.19762E-001 -0.27222       0.32426       0.30939     
    4   0.59765E-001  0.22059E-001 -0.24529       0.30505       0.28806     
    5   0.70704E-001 -0.51976E-002 -0.13862       0.20932       0.18354     
    6   0.11906       0.44607E-001 -0.17493       0.29399       0.26474     
    7   0.25540       0.95993E-002 -0.43110       0.68650       0.60246     
    8   0.52246E-001 -0.47008E-001 -0.35167       0.40391       0.36456     
    9   0.32215E-001 -0.62291E-001 -0.28800       0.32021       0.28497     
   10   0.28072E-001 -0.68269E-001 -0.28304       0.31111       0.27586     
   11   0.25990E-001 -0.78663E-001 -0.28626       0.31225       0.27527     
   12   0.26657E-001 -0.79217E-001 -0.29507       0.32173       0.28400

or using any awk:

$ awk 'NR>1{ gsub(/-?[0-9]+(\.[0-9]+)?([Ee][+-]?[0-9]+)?/," &") } 1' file
 NODE     S1           S2           S3           SINT         SEQV
    1   0.14919      -0.58396E-001 -0.71230       0.86149       0.77873
    2   0.56037E-001  0.23261E-002 -0.37154       0.42757       0.40341
    3   0.52036E-001  0.19762E-001 -0.27222       0.32426       0.30939
    4   0.59765E-001  0.22059E-001 -0.24529       0.30505       0.28806
    5   0.70704E-001 -0.51976E-002 -0.13862       0.20932       0.18354
    6   0.11906       0.44607E-001 -0.17493       0.29399       0.26474
    7   0.25540       0.95993E-002 -0.43110       0.68650       0.60246
    8   0.52246E-001 -0.47008E-001 -0.35167       0.40391       0.36456
    9   0.32215E-001 -0.62291E-001 -0.28800       0.32021       0.28497
   10   0.28072E-001 -0.68269E-001 -0.28304       0.31111       0.27586
   11   0.25990E-001 -0.78663E-001 -0.28626       0.31225       0.27527
   12   0.26657E-001 -0.79217E-001 -0.29507       0.32173       0.28400

With this method you don't have to know the number of fields and field widths, therefore it can be more robust against data format changes.

How to separate column using character length with AWK

Tags:

bash

awk

massisenergy

2 Answers

JNevill

Andriy Makukha

Recent Activity

Donate For Us

How to separate column using character length with AWK

Tags:

bash

awk

massisenergy

2 Answers

JNevill

Andriy Makukha

Related questions

Recent Activity

Donate For Us