Aggregate CSV data based on unique ID using command-line tools [closed]

Question

I would like to aggregate data from one CSV file into another, using command-line tools such as bash, sed, awk etc. Each row in each file begins with a unique ID in column 1, and if a row in the destination matches this ID, the line should be replaced, otherwise it should be appended. The input data is not sorted, but the sort order of the result doesn't matter so files can be sorted as part of the script if it helps.

For example, given the current state of aggregate.csv:

1,olddata
3,olddata
2,olddata

And the file new.csv:

4,newdata
2,changeddata
3,changeddata

I would like aggregate.csv to come out as follows (in any sort order):

1,olddata
2,changeddata
3,changeddata
4,newdata

The lines may contain a large number of columns so replacing one cell at a time is not ideal. The CSVs are guaranteed not to contain quoted newlines, so a line-by-line search and replacing a whole line at a time is a valid approach.

Apologies if this is a duplicate but I can't find another question that exactly uses this CSV merging approach.

I have tried adapting the answers to this question but it required first generating the "template file" of all IDs by parsing both files line-by-line, sorting, removing duplicates, and saving - I hoped that a simpler method was possible.

This question has answers in sed and awk that I also replicated, and managed the regex-replacement part but not a method to append a new line to the file if a match did not exist.

Thanks!

markp-fuso · Accepted Answer

Assumptions:

replace entire lines
total file size is ~8-10 MB (so likely won't have any out-of-memory issues)
field #1 is unique within each file
no sorting requirements for output
input not guaranteed to be sorted

One awk solution:

awk -F"," '                      # input delimiter is comma ","

FNR==NR { a[$1]=$0 ; next }      # first file: store contents in array a[], using field #1 as index

        { if ( $1 in a )         # second file: if field #1 is an index in array a[] then
             { print a[$1]       # print the contents of said array entry and
               delete a[$1]      # remove entry from array
             }
          else                   # field #1 is not an index in array a[] so
             print $0            # print current line
        }

END     { for ( i in a )         # anything still in array a[] was not in second file so
              print a[i]         # print these entries
        }
' new.csv aggregate.csv          # order of input files is important !!!

This geneates:

1,olddata
3,changeddata
2,changeddata
4,newdata

NOTE: While the output appears to be ordered by field #1, this is merely a coincidence based on the ordering of the input data; there's nothing in this awk code that explicitly generated this 'ordered' output.

This will not overwrite the aggregate.csv file. While it's possible to use the GNU awk -i inplace option to overwrite aggregate.csv, this would require a change in processing to keep from overwriting new.csv. The easiest solution will likely be to direct this output to a new file and then OP can decide if said new file should be used to overwrite aggregate.csv.

Ed Morton · Answer

With any awk:

$ awk -F, '!seen[$1]++' new agg
4,newdata
2,changeddata
3,changeddata
1,olddata

or with GNU sort for -s:

$ sort -ust, -k1,1 new agg
1,olddata
2,changeddata
3,changeddata
4,newdata

Aggregate CSV data based on unique ID using command-line tools [closed]

Tags:

bash

join

command-line

sed

awk

Ian Renton

2 Answers

markp-fuso

Ed Morton

Recent Activity

Donate For Us

Aggregate CSV data based on unique ID using command-line tools [closed]

Tags:

bash

join

command-line

sed

awk

Ian Renton

2 Answers

markp-fuso

Ed Morton

Related questions

Recent Activity

Donate For Us