Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove duplicate comma separate strings using awk

Tags:

bash

csv

awk

I have a csv file like this: (named test2.csv)

lastname,firstname,83494989,1997-05-20,2015-05-07 15:30:43,Sentence Skills 104,Sentence Skills 104,Elementary Algebra 38,Elementary Algebra 38,Sentence Skills 104,Sentence Skills 104,Elementary Algebra 38,Elementary Algebra 38,

I want to remove the duplicate entries

The closest I have got is the following awk command

awk '{a[$0]++} END {for (i in a) print RS i}' RS="," test2.csv

it works but causes new problems, it take the values out of order and puts them in rows like this:

,Elementary Algebra 38
,2015-05-07 15:30:43
,Sentence Skills 104
,FirstName
,LastName
,1997-05-20
,83494989

I need to keep the order they are in and keep them in one line ( I can fix the row issue, but don't know how to fix the order issue)

Update with Solution:

The answer from anubhava worked great, I added a question about removing the time from the date and Ed Morton helped out with that, here is the full query

awk 'BEGIN{RS=ORS=","} {sub(/ ..:..:..$/,"")} !seen[$0]++' test2.csv
like image 557
moore1emu Avatar asked Mar 01 '26 14:03

moore1emu


1 Answers

You can just use this awk:

awk 'BEGIN{RS=ORS=","} !seen[$0]++' test2.csv
lastname,firstname,83494989,1997-05-20,2015-05-07 15:30:43,Sentence Skills 104,Elementary Algebra 38,
like image 185
anubhava Avatar answered Mar 04 '26 02:03

anubhava



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!