I have a data, which is not formatted according to proper field separator (which is awk good at). But, what I know is, the data is fixed width.
NODE S1 S2 S3 SINT SEQV
1 0.14919 -0.58396E-001-0.71230 0.86149 0.77873
2 0.56037E-001 0.23261E-002-0.37154 0.42757 0.40341
3 0.52036E-001 0.19762E-001-0.27222 0.32426 0.30939
4 0.59765E-001 0.22059E-001-0.24529 0.30505 0.28806
5 0.70704E-001-0.51976E-002-0.13862 0.20932 0.18354
6 0.11906 0.44607E-001-0.17493 0.29399 0.26474
7 0.25540 0.95993E-002-0.43110 0.68650 0.60246
8 0.52246E-001-0.47008E-001-0.35167 0.40391 0.36456
9 0.32215E-001-0.62291E-001-0.28800 0.32021 0.28497
10 0.28072E-001-0.68269E-001-0.28304 0.31111 0.27586
11 0.25990E-001-0.78663E-001-0.28626 0.31225 0.27527
12 0.26657E-001-0.79217E-001-0.29507 0.32173 0.28400
The expected output is something like this (the numbers in other columns can have format like in S2, in other parts of the data):
NODE S1 S2 S3 SINT SEQV
1 0.14919 -0.58396E-001 -0.71230 0.86149 0.77873
2 0.56037E-001 0.23261E-002 -0.37154 0.42757 0.40341
3 0.52036E-001 0.19762E-001 -0.27222 0.32426 0.30939
4 0.59765E-001 0.22059E-001 -0.24529 0.30505 0.28806
5 0.70704E-001 -0.51976E-002 -0.13862 0.20932 0.18354
6 0.11906 0.44607E-001 -0.17493 0.29399 0.26474
7 0.25540 0.95993E-002 -0.43110 0.68650 0.60246
8 0.52246E-001 -0.47008E-001 -0.35167 0.40391 0.36456
9 0.32215E-001 -0.62291E-001 -0.28800 0.32021 0.28497
10 0.28072E-001 -0.68269E-001 -0.28304 0.31111 0.27586
11 0.25990E-001 -0.78663E-001 -0.28626 0.31225 0.27527
12 0.26657E-001 -0.79217E-001 -0.29507 0.32173 0.28400
The main problem is, sometimes the column are separated by space, sometime no space ('-' sign occupies the space), that is, no proper field separator. I found a similar question here, but that data was consistent in terms of field separator. There might be two ways, according to my thinking,
I'm new to AWK, I know it might be very easy with some other tools, but I would like to know if it's possible to separate/extract these columns using awk. I am using awk in terminal in MacOS.
You can split fields in a fixed width file in awk by specifying FIELDWIDTHS in your BEGIN block:
$ cat test
NODE S1 S2 S3 SINT SEQV
1 0.14919 -0.58396E-001-0.71230 0.86149 0.77873
2 0.56037E-001 0.23261E-002-0.37154 0.42757 0.40341
3 0.52036E-001 0.19762E-001-0.27222 0.32426 0.30939
4 0.59765E-001 0.22059E-001-0.24529 0.30505 0.28806
5 0.70704E-001-0.51976E-002-0.13862 0.20932 0.18354
6 0.11906 0.44607E-001-0.17493 0.29399 0.26474
7 0.25540 0.95993E-002-0.43110 0.68650 0.60246
8 0.52246E-001-0.47008E-001-0.35167 0.40391 0.36456
9 0.32215E-001-0.62291E-001-0.28800 0.32021 0.28497
10 0.28072E-001-0.68269E-001-0.28304 0.31111 0.27586
11 0.25990E-001-0.78663E-001-0.28626 0.31225 0.27527
12 0.26657E-001-0.79217E-001-0.29507 0.32173 0.28400
$ awk 'BEGIN{ FIELDWIDTHS= "5 13 13 13 13"; OFS="|"}{$1=$1}1' test
NODE | S1 | S2 | S3 | SINT
1 | 0.14919 |-0.58396E-001|-0.71230 | 0.86149
2 | 0.56037E-001| 0.23261E-002|-0.37154 | 0.42757
3 | 0.52036E-001| 0.19762E-001|-0.27222 | 0.32426
4 | 0.59765E-001| 0.22059E-001|-0.24529 | 0.30505
5 | 0.70704E-001|-0.51976E-002|-0.13862 | 0.20932
6 | 0.11906 | 0.44607E-001|-0.17493 | 0.29399
7 | 0.25540 | 0.95993E-002|-0.43110 | 0.68650
8 | 0.52246E-001|-0.47008E-001|-0.35167 | 0.40391
9 | 0.32215E-001|-0.62291E-001|-0.28800 | 0.32021
10 | 0.28072E-001|-0.68269E-001|-0.28304 | 0.31111
11 | 0.25990E-001|-0.78663E-001|-0.28626 | 0.31225
12 | 0.26657E-001|-0.79217E-001|-0.29507 | 0.32173
$ awk 'BEGIN{ FIELDWIDTHS= "5 13 13 13 13"; OFS="\t"}{$1=$1}1' test
NODE S1 S2 S3 SINT
1 0.14919 -0.58396E-001 -0.71230 0.86149
2 0.56037E-001 0.23261E-002 -0.37154 0.42757
3 0.52036E-001 0.19762E-001 -0.27222 0.32426
4 0.59765E-001 0.22059E-001 -0.24529 0.30505
5 0.70704E-001 -0.51976E-002 -0.13862 0.20932
6 0.11906 0.44607E-001 -0.17493 0.29399
7 0.25540 0.95993E-002 -0.43110 0.68650
8 0.52246E-001 -0.47008E-001 -0.35167 0.40391
9 0.32215E-001 -0.62291E-001 -0.28800 0.32021
10 0.28072E-001 -0.68269E-001 -0.28304 0.31111
11 0.25990E-001 -0.78663E-001 -0.28626 0.31225
12 0.26657E-001 -0.79217E-001 -0.29507 0.32173
Probably have to tweak that a bit for whatever constitutes an actual field here, but that's the gist of it.
You could also parse your number format with regex and insert space before each number, e.g. using GNU awk for gensub():
awk 'NR==1 { print; next } { print gensub(/(-?[0-9]+(\.[0-9]+)?([Ee][+-]?[0-9]+)?)/," \\1", "g") }' FILE
Output:
NODE S1 S2 S3 SINT SEQV
1 0.14919 -0.58396E-001 -0.71230 0.86149 0.77873
2 0.56037E-001 0.23261E-002 -0.37154 0.42757 0.40341
3 0.52036E-001 0.19762E-001 -0.27222 0.32426 0.30939
4 0.59765E-001 0.22059E-001 -0.24529 0.30505 0.28806
5 0.70704E-001 -0.51976E-002 -0.13862 0.20932 0.18354
6 0.11906 0.44607E-001 -0.17493 0.29399 0.26474
7 0.25540 0.95993E-002 -0.43110 0.68650 0.60246
8 0.52246E-001 -0.47008E-001 -0.35167 0.40391 0.36456
9 0.32215E-001 -0.62291E-001 -0.28800 0.32021 0.28497
10 0.28072E-001 -0.68269E-001 -0.28304 0.31111 0.27586
11 0.25990E-001 -0.78663E-001 -0.28626 0.31225 0.27527
12 0.26657E-001 -0.79217E-001 -0.29507 0.32173 0.28400
or using any awk:
$ awk 'NR>1{ gsub(/-?[0-9]+(\.[0-9]+)?([Ee][+-]?[0-9]+)?/," &") } 1' file
NODE S1 S2 S3 SINT SEQV
1 0.14919 -0.58396E-001 -0.71230 0.86149 0.77873
2 0.56037E-001 0.23261E-002 -0.37154 0.42757 0.40341
3 0.52036E-001 0.19762E-001 -0.27222 0.32426 0.30939
4 0.59765E-001 0.22059E-001 -0.24529 0.30505 0.28806
5 0.70704E-001 -0.51976E-002 -0.13862 0.20932 0.18354
6 0.11906 0.44607E-001 -0.17493 0.29399 0.26474
7 0.25540 0.95993E-002 -0.43110 0.68650 0.60246
8 0.52246E-001 -0.47008E-001 -0.35167 0.40391 0.36456
9 0.32215E-001 -0.62291E-001 -0.28800 0.32021 0.28497
10 0.28072E-001 -0.68269E-001 -0.28304 0.31111 0.27586
11 0.25990E-001 -0.78663E-001 -0.28626 0.31225 0.27527
12 0.26657E-001 -0.79217E-001 -0.29507 0.32173 0.28400
With this method you don't have to know the number of fields and field widths, therefore it can be more robust against data format changes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With