Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Renumbering a column based on occurrence of a string

Tags:

sed

awk

seq

Fairly new to linux, I apologize.

I have a file as such:

1   C   foo   C     bar
2   C   foo   C     bar
3   C   foo   C     bar
4   H   foo   H     bar
5   H   foo   H     bar
6   O   foo   O     bar

And I need to get it to be:

1   C01 foo   C     bar
2   C02 foo   C     bar
3   C03 foo   C     bar
4   H01 foo   H     bar
5   H02 foo   H     bar
6   O01 foo   O     bar

**Unfortunately the spacing between foo and C as well as the spacing between C and bar must be maintained.

I have tried it in a piecewise manner, where I pull out lines containing the different identifiers, C, H, and O, placing them in a temp file. Then I attempt to order them by occurance, and then splice the original file back together.

    #!/bin/bash

    sed -i -e "/ C /w temp1.txt" -e "//d" File.txt
    sed -i -e "/ H /w temp2.txt" -e "//d" File.txt
    sed -i -e "/ O /w temp3.txt" -e "//d" File.txt


    `awk -i '{print NR $2}' temp1.txt
    awk -i '{print NR $2}' temp2.txt
    awk -i '{print NR $2}' temp3.txt

    cat temp1.txt >> File.txt
    cat temp2.txt >> File.txt
    cat temp3.txt >> File.txt

However I am pretty sure my syntax is awful, as I am really only familiar with sed rather than awk.

Any help would be greatly appreciated, thank you.

like image 758
Wagner AG Avatar asked Dec 06 '25 23:12

Wagner AG


2 Answers

same solution while preserving the initial field positions

$ awk '{r=sprintf("%02d",++a[$2]); sub($2"  ",$2r)}1' file

1   C01 foo   C     bar
2   C02 foo   C     bar
3   C03 foo   C     bar
4   H01 foo   H     bar
5   H02 foo   H     bar
6   O01 foo   O     bar

Note that this assumes first field values don't overlap with the second field values as shown, otherwise you need to guard to keep changes only to the second field. For second field it can be easily done by prefixing match and replacement values with single space.

like image 192
karakfa Avatar answered Dec 08 '25 23:12

karakfa


EDIT: Here is a solution with GNU awk which preserves actual spaces. If your split supports 4 arguments. After reading man page I got it, even I am happy that I found it, it will be helpful.

awk '
{
  n=split($0,array," ",b)
  array[2]=sprintf("%s%02d",array[2],++a[array[2]])
  line=b[0]
  for(i=1;i<=n;i++){
    line=(line array[i] b[i])
  }
  print line
}'  Input_file
1   C01   foo   C     bar
2   C02   foo   C     bar
3   C03   foo   C     bar
4   H01   foo   H     bar
5   H02   foo   H     bar
6   O01   foo   O     bar

About split in GNU awk man page for 4 arguments:

   split(s, a [, r [, seps] ])
                           Split the string s into the array a and the separators array seps on the regular expression r, and return the

number of fields. If r is omitted, FS is used instead. The arrays a and seps are cleared first. seps[i] is the field separator matched by r between a[i] and a[i+1]. If r is a single space, then leading whitespace in s goes into the extra array element seps[0] and trailing white- space goes into the extra array element seps[n], where n is the return value of split(s, a, r, seps). Splitting behaves identically to field splitting, described above.



1st solution: Could you please try following,

awk '{$2=sprintf("%s%02d",$2,++a[$2])} 1' Input_file

Output will be as follows.

1 C01 bar C
2 C02 bar C
3 C03 bar C
4 H01 bar H
5 H02 bar H
6 O01 bar O

2nd solution: In case you want to have values in $2 and $4 both places then do following.

awk '{$2=$4=sprintf("%s%02d",$2,++a[$2])} 1'  Input_file
1 C01 bar C01
2 C02 bar C02
3 C03 bar C03
4 H01 bar H01
5 H02 bar H02
6 O01 bar O01

3rd solution: In case you want to add/insert a new column at last of line then do following.

awk '{$(NF+1)=sprintf("%s%02d",$2,++a[$2])} 1'  Input_file
1 C bar C C01
2 C bar C C02
3 C bar C C03
4 H bar H H01
5 H bar H H02
6 O bar O O01
like image 43
RavinderSingh13 Avatar answered Dec 09 '25 00:12

RavinderSingh13