Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Awk: syntax for looping over associative array

Tags:

bash

csv

zsh

awk

I'm trying to parse a file, separated by a comma (CSV file). Imagine a list of books and their authors, the second column is the author. Since the file has millions of rows (around 133M), I cannot do this with Python or Java (I mean, I can, but it takes way too long), so I decided to use bash or zsh (which is the interpreter installed in the Mac).

I need to count how many books each author has, which means, counting the occurrences of each unique value on the second column.

#!/bin/zsh

awk -v FS="," 'NR > 1 {        
  author_id = $2;              
  count[$author_id]++;         
} 
END {                            
  PROCINFO["sorted_in_place"] = 1;  
  for (author_id, count) in count; {
    printf "%d, %d\n", author_id, count;  
  }
}' "~/list_of_books_per_author.csv" | sort -nrk2,2 | head -n 10

I keep getting this error:

awk: syntax error at source line 7
 context is
      for >>>  (author_id, <<<
awk: illegal statement at source line 7
awk: illegal statement at source line 7

I'm not really aware of what I'm doing wrong now. How do you iterate an associative array, when you want both values, key and value?

like image 787
user3049941 Avatar asked Mar 02 '26 00:03

user3049941


2 Answers

Awk is completely its own language, and completely distinct from both Bash and Zsh (which between themselves are also two distinct, incompatible languages, though related). However, if you are learning to use the shell, you will probably also want to learn at least the basics of Awk (and sed), too.

Your attempt to loop over an associative array uses completely the wrong syntax. You want

  for (author_id in count) {
    printf "%d, %d\n", count[author_id], author_id;  
  }

Note also how there should be no semicolon before the opening brace.

I'm guessing you also meant

  count[author_id]++;

earlier in the script. ($author_id would use the integer in author_id as the index into the fields; so if author_id is 3, you were doing count[$3]++)

Depending on how complex your CSV file is, Awk may or may not be inadequate. It copes well with completely trivial CSV files, but is less ideal for complex ones with quoted literal commas and/or quoted literal newlines in the CSV data.

You are probably doing something wrong if you think Awk is going to be much faster than compiled Java. Python is probably a bit slower, but none of them should be catastrophically slow when it comes to reading one line at a time and spitting out a result. (But if you mean it would take longer for you to write a quick program for this in Java, that's probably true.)

As a further aside, you want ~/"list_of_books_per_author.csv"; at least in Bash, the tilde will not be expanded if it is in double quotes.

like image 80
tripleee Avatar answered Mar 03 '26 17:03

tripleee


You got an answer to the question you asked about your awk error but in regards to "what I'm doing wrong now", it sounds to me like you're interested in speed of execution and so I wouldn't do all that work in awk and then pipe it to sort, I'd sort first and then pipe it to awk if necessary. Consider this alternative approach:

$ cat file
foo,Clive Barker
bar,Danielle Steele
etc,Clive Barker

$ cut -d, -f2 file | sort | uniq -c
      2 Clive Barker
      1 Danielle Steele

If you want the output rows ordered by count then pipe the above to sort -n or sort -nr as you like:

$ cut -d, -f2 file | sort | uniq -c | sort -n
      1 Danielle Steele
      2 Clive Barker

$ cut -d, -f2 file | sort | uniq -c | sort -nr
      2 Clive Barker
      1 Danielle Steele

If you want the output columns in a different order, pipe the above to either of these (after piping to sort -n or sort -nr if you like):

$ cut -d, -f2 file | sort | uniq -c |
    sed 's/ *\([0-9]*\) \(.*\)/\2, \1/'
Clive Barker, 2
Danielle Steele, 1

$ cut -d, -f2 file | sort | uniq -c |
    awk -v OFS=', ' '{n=$1; sub(/ +[0-9]+ /,""); print $0, n}'
Clive Barker, 2
Danielle Steele, 1

I doubt if you really want | sort -nrk2,2 from your script, btw, as that'd sort by the author's second name. You probably meant | sort -t, -nrk2,2.

like image 36
Ed Morton Avatar answered Mar 03 '26 17:03

Ed Morton



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!