Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

removing duplicated strings within a column with shell

Tags:

shell

sed

awk

I have a file with two columns separated by tabs as follows:

OG0000000   PF03169,PF03169,PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,PF00083,PF07690,PF00083,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,PF00012,

I just want to remove duplicate strings within the second column, while not changing anything in the first column, so that my final output looks like this:

OG0000000   PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,

I tried to start this by using awk.

awk 'BEGIN{RS=ORS=","} !seen[$0]++' file.txt

But my output looks like this, where there are still some duplicates if the duplicated string occurs first.

OG0000000   PF03169,PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,PF07690,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,PF00012,

I realize that the problem is because the first line that awk grabs is everything until the first comma, but I'm still rough with awk commands and couldn't figure out how to fix this without messing up the first column. Thanks in advance!

like image 835
D.Parker Avatar asked Mar 19 '26 16:03

D.Parker


1 Answers

This awk should work for you:

awk -F '[\t,]' '
{
   printf "%s", $1 "\t"
   for (i=2; i<=NF; ++i) {
      if (!seen[$i]++)
         printf "%s,", $i
   }
   print ""
   delete seen
}' file

OG0000000   PF03169,MAC1_004431-T1,
OG0000002   PF07690,PF00083,
OG0000003   MAC1_000127-T1,
OG0000004   PF13246,PF00689,PF00690,
OG0000005   PF00012,PF01061,PF12697,

PS: As per the expected output shown this solution also shows a trailing comma in each line.

like image 124
anubhava Avatar answered Mar 26 '26 04:03

anubhava



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!