In awk scripts, I see several methods used to concatenate an array of strings with a separator, mainly trinary being used in some way, or the separator is changed on the fly. Occasionally printf/sprintf is used but I won't consider that here as I'd be surprised if the function call is not expensive.
For example, given an awk array a of strings, with integer indices from 1 to max, and separator sep:
trinary:
j=""
for ( i=1; i<=max; i++ )
j = ( j ? j sep a[i] : a[i] )
modify:
j=s=""
for ( i=1; i<=max; i++ ) {
j = j s a[i]
s = sep
}
I have just come up with this method which I don't recall seeing before (ed: from subsequent comments, this turns out to be similar to a method used by awklib):
mine (for):
j = a[1]
for ( i=1; i<max; )
j = j sep a[++i]
Naively, I would assume this to be more efficient as:
Another option I just thought of:
mine (while):
j=a[i=max]
while (i>1)
j = a[--i] sep j
Running a simple benchmark of gawk and mawk on a mostly idle Ubuntu 20.04 laptop (9 trials each of 1,000,000 iterations of joining a 64-element array of 50-char strings) gave me:
gawk
| algorithm | min | max | median | mean |
|---|---|---|---|---|
| trinary | 13.863 | 15.214 | 14.148 | 14.330 |
| modify | 11.052 | 11.926 | 11.360 | 11.468 |
| mine (for) | 8.612 | 8.792 | 8.698 | 8.710 |
| mine (while) | 12.267 | 13.263 | 12.591 | 12.706 |
mawk
| algorithm | min | max | median | mean |
|---|---|---|---|---|
| trinary | 12.406 | 13.299 | 12.827 | 12.806 |
| modify | 12.975 | 13.744 | 13.085 | 13.294 |
| mine (for) | 11.641 | 12.397 | 12.233 | 12.151 |
| mine (while) | 8.400 | 9.083 | 8.693 | 8.693 |
busybox awk is much less performant. Using 100,000 iterations:
| algorithm | min | max | median | mean |
|---|---|---|---|---|
| trinary | 11.004 | 12.284 | 11.784 | 11.605 |
| modify | 11.708 | 12.628 | 11.879 | 12.071 |
| mine (for) | 10.695 | 11.642 | 10.776 | 11.021 |
| mine (while) | 9.244 | 10.167 | 9.409 | 9.549 |
original-awk is faster than busybox in this situation. Using 250,000 iterations:
| algorithm | min | max | median | mean |
|---|---|---|---|---|
| trinary | 12.750 | 13.525 | 12.840 | 13.014 |
| modify | 13.105 | 13.778 | 13.942 | 13.572 |
| mine (for) | 11.831 | 12.543 | 12.710 | 12.281 |
| mine (while) | 10.730 | 10.854 | 11.466 | 11.032 |
I guessed my second method might be even faster due to not needing to dereference max in the loop but, intriguingly, although it runs noticably faster than the other algorithms in most implementations, it runs poorly in gawk, where my first method is the clear winner.
Are there even better methods?
Is there a best idiom for awk join?
My benchmark code was just:
for v in gawk mawk "busybox awk" original-awk; do
for m in 1 2 3 4; do
echo ::: $v $m :::
for n in 1 2 3 4 5 6 7 8 9; do
time </dev/null $v -f aw$m-${v:0:1}
done
echo
done
done
aw3-b: (others are similar)
BEGIN {
a[ 1] = "a[ 1]a[ 1]a[ 1]a[ 1]a[ 1]a[ 1]a[ 1]a[ 1]a[ 1]a[ 1]"
a[ 2] = "a[ 2]a[ 2]a[ 2]a[ 2]a[ 2]a[ 2]a[ 2]a[ 2]a[ 2]a[ 2]"
# ... more lines ...
a[63] = "a[63]a[63]a[63]a[63]a[63]a[63]a[63]a[63]a[63]a[63]"
a[64] = "a[64]a[64]a[64]a[64]a[64]a[64]a[64]a[64]a[64]a[64]"
max=64
sep=FS
for(n=1; n<=100000; n++) {
j=a[1]
for (i=1;i<max;)
j = j sep a[++i]
}
}
j = a[1]
for ( i=2; i<=max; i++ )
j = j sep a[i]
is the clearest and most common approach so IMHO it's what should be used even if a tiny performance improvement could be squeezed out of a slightly less clear script.
The time when you typically see a ternary being used is when printing the contents of the array as:
for ( i=1; i<=max; i++ )
printf "%s%s", a[i], (i<max ? OFS : ORS)
instead of the slightly more efficient but lengthier and requiring the printf formatting string and associated data values to be duplicated before the loop for the first index and inside the loop for all of the other indices:
printf "%s", a[1]
for ( i=2; i<=max; i++ )
printf "%s%s", OFS, a[i]
print ""
Just yesterday I was surprised by a huge performance improvement using a recursive descent function instead of a loop, see https://stackoverflow.com/a/68389244/1745001, so I thought I'd give that a try but it failed miserably:
$ head -50 tst*.awk
==> tstLoop.awk <==
BEGIN {
max = 64
for (i=1; i<=max; i++) {
a[i] = sprintf("%50s",i)
}
for (n=1; n<=100000; n++) {
j = a[1]
for (i=2; i<=max; i++) {
j = j OFS a[i]
}
}
}
==> tstRec.awk <==
function join(i){
if( i == 1 ) return a[i]
return join(i-1) OFS a[i]
}
BEGIN {
max = 64
for (i=1; i<=max; i++) {
a[i] = sprintf("%50s",i)
}
for (n=1; n<=100000; n++) {
j = join(max)
}
}
$ time awk -f tstLoop.awk
real 0m2.596s
user 0m2.561s
sys 0m0.030s
$ time awk -f tstRec.awk
real 0m3.743s
user 0m3.733s
sys 0m0.000s
I do not know if it is better in the sense that the overall performance is in any way better than the for loop, but it is more clean for regular record processing use cases where the array is implicitly given from the record numbers:
awk 'FNR==1 {printf $1} FNR>1 {printf " "$1} END {printf "\n"}'
I understand that this is not what the Author really asked for, but it will likely be what people finding this post are really looking for.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With