I would like to aggregate data from one CSV file into another, using command-line tools such as bash, sed, awk etc. Each row in each file begins with a unique ID in column 1, and if a row in the destination matches this ID, the line should be replaced, otherwise it should be appended. The input data is not sorted, but the sort order of the result doesn't matter so files can be sorted as part of the script if it helps.
For example, given the current state of aggregate.csv:
1,olddata
3,olddata
2,olddata
And the file new.csv:
4,newdata
2,changeddata
3,changeddata
I would like aggregate.csv to come out as follows (in any sort order):
1,olddata
2,changeddata
3,changeddata
4,newdata
The lines may contain a large number of columns so replacing one cell at a time is not ideal. The CSVs are guaranteed not to contain quoted newlines, so a line-by-line search and replacing a whole line at a time is a valid approach.
Apologies if this is a duplicate but I can't find another question that exactly uses this CSV merging approach.
I have tried adapting the answers to this question but it required first generating the "template file" of all IDs by parsing both files line-by-line, sorting, removing duplicates, and saving - I hoped that a simpler method was possible.
This question has answers in sed and awk that I also replicated, and managed the regex-replacement part but not a method to append a new line to the file if a match did not exist.
Thanks!
Assumptions:
One awk solution:
awk -F"," ' # input delimiter is comma ","
FNR==NR { a[$1]=$0 ; next } # first file: store contents in array a[], using field #1 as index
{ if ( $1 in a ) # second file: if field #1 is an index in array a[] then
{ print a[$1] # print the contents of said array entry and
delete a[$1] # remove entry from array
}
else # field #1 is not an index in array a[] so
print $0 # print current line
}
END { for ( i in a ) # anything still in array a[] was not in second file so
print a[i] # print these entries
}
' new.csv aggregate.csv # order of input files is important !!!
This geneates:
1,olddata
3,changeddata
2,changeddata
4,newdata
NOTE: While the output appears to be ordered by field #1, this is merely a coincidence based on the ordering of the input data; there's nothing in this awk code that explicitly generated this 'ordered' output.
This will not overwrite the aggregate.csv file. While it's possible to use the GNU awk -i inplace option to overwrite aggregate.csv, this would require a change in processing to keep from overwriting new.csv. The easiest solution will likely be to direct this output to a new file and then OP can decide if said new file should be used to overwrite aggregate.csv.
With any awk:
$ awk -F, '!seen[$1]++' new agg
4,newdata
2,changeddata
3,changeddata
1,olddata
or with GNU sort for -s:
$ sort -ust, -k1,1 new agg
1,olddata
2,changeddata
3,changeddata
4,newdata
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With