I'm trying to parse a file, separated by a comma (CSV file). Imagine a list of books and their authors, the second column is the author. Since the file has millions of rows (around 133M), I cannot do this with Python or Java (I mean, I can, but it takes way too long), so I decided to use bash or zsh (which is the interpreter installed in the Mac).
I need to count how many books each author has, which means, counting the occurrences of each unique value on the second column.
#!/bin/zsh
awk -v FS="," 'NR > 1 {
author_id = $2;
count[$author_id]++;
}
END {
PROCINFO["sorted_in_place"] = 1;
for (author_id, count) in count; {
printf "%d, %d\n", author_id, count;
}
}' "~/list_of_books_per_author.csv" | sort -nrk2,2 | head -n 10
I keep getting this error:
awk: syntax error at source line 7
context is
for >>> (author_id, <<<
awk: illegal statement at source line 7
awk: illegal statement at source line 7
I'm not really aware of what I'm doing wrong now. How do you iterate an associative array, when you want both values, key and value?
Awk is completely its own language, and completely distinct from both Bash and Zsh (which between themselves are also two distinct, incompatible languages, though related). However, if you are learning to use the shell, you will probably also want to learn at least the basics of Awk (and sed), too.
Your attempt to loop over an associative array uses completely the wrong syntax. You want
for (author_id in count) {
printf "%d, %d\n", count[author_id], author_id;
}
Note also how there should be no semicolon before the opening brace.
I'm guessing you also meant
count[author_id]++;
earlier in the script. ($author_id would use the integer in author_id as the index into the fields; so if author_id is 3, you were doing count[$3]++)
Depending on how complex your CSV file is, Awk may or may not be inadequate. It copes well with completely trivial CSV files, but is less ideal for complex ones with quoted literal commas and/or quoted literal newlines in the CSV data.
You are probably doing something wrong if you think Awk is going to be much faster than compiled Java. Python is probably a bit slower, but none of them should be catastrophically slow when it comes to reading one line at a time and spitting out a result. (But if you mean it would take longer for you to write a quick program for this in Java, that's probably true.)
As a further aside, you want ~/"list_of_books_per_author.csv"; at least in Bash, the tilde will not be expanded if it is in double quotes.
You got an answer to the question you asked about your awk error but in regards to "what I'm doing wrong now", it sounds to me like you're interested in speed of execution and so I wouldn't do all that work in awk and then pipe it to sort, I'd sort first and then pipe it to awk if necessary. Consider this alternative approach:
$ cat file
foo,Clive Barker
bar,Danielle Steele
etc,Clive Barker
$ cut -d, -f2 file | sort | uniq -c
2 Clive Barker
1 Danielle Steele
If you want the output rows ordered by count then pipe the above to sort -n or sort -nr as you like:
$ cut -d, -f2 file | sort | uniq -c | sort -n
1 Danielle Steele
2 Clive Barker
$ cut -d, -f2 file | sort | uniq -c | sort -nr
2 Clive Barker
1 Danielle Steele
If you want the output columns in a different order, pipe the above to either of these (after piping to sort -n or sort -nr if you like):
$ cut -d, -f2 file | sort | uniq -c |
sed 's/ *\([0-9]*\) \(.*\)/\2, \1/'
Clive Barker, 2
Danielle Steele, 1
$ cut -d, -f2 file | sort | uniq -c |
awk -v OFS=', ' '{n=$1; sub(/ +[0-9]+ /,""); print $0, n}'
Clive Barker, 2
Danielle Steele, 1
I doubt if you really want | sort -nrk2,2 from your script, btw, as that'd sort by the author's second name. You probably meant | sort -t, -nrk2,2.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With