Was testing a legacy script in the new version of bash 4.1.2(1)-release , and encountered this warning in the console:
join: file 1 is not in sorted order
join: file 2 is not in sorted order
I am quite sure that both of the files are sorted. The files actually merged properly.
Below is the script:
cat $FILE1_PATH'.processed.1' | cut -d'|' -f4,8 | sort | uniq -u > $FILE1_PATH.'processed.2'
cat $FILE2_PATH'.processed.1' | cut -d'|' -f1,8 | sort | uniq -u > $FILE2_PATH.'processed.2'
join -t$'|' -1 1 -2 1 $FILE1_PATH.'processed.2' $FILE2_PATH.'processed.2' > $MERGEFILE_PATH
To job of this script :
FILE1.processed.2 :
21VIANET GP INC|GOV
ABN|ABN1
ABN|ABN2
ABOC|ABOC1
ABOC|ABOC1
ABOC|ABOC2
....
FILE2.processed.2 :
ABN|Banks
ABOC|Pharmaceuticals
GOV|Government Agency
....
OUTPUT:
GOV|21VIANET GP INC|Government Agency
ABN|ABN1|Banks
ABN|ABN2|Banks
ABOC|ABOC1|Pharmaceuticals
ABOC|ABOC2|Pharmaceuticals
....
Running the same script in the bash version 3.2.25(1)-release gives no warning. Any idea to solve the warning?
UPDATE: Seems that the cause was caused by these lines in the input files...
ADBC|Banks
ADB|Banks
Join expects ADBC to be positioned after ADB, like below :
ADB|Banks
ADBC|Banks
However I tried to change my sort script from sort -u to sort -t$'|' -k1 (sort based on the first field ) however still not working...
The suggestion in the join
man page is to use sort -k 1b,1
when you're joining on field 1. (It says "when join has no options" but as far as field selection is concerned, your join is equivalent to no options. -1 1
and -2 1
are the defaults.) You can add -t '|'
to that and it will match your join
perfectly.
-k1
means all fields from 1 to the end. -k1,1
means just field 1. The b
is necessary if you have leading whitespace and want to ignore it. sort syntax is weird. And this is after POSIX redesigned it to try to make it sensible. If you ever write a sort command that doesn't look complicated, it's probably not doing what you wanted.
Add --debug
to your sort command to see what it's using as a key. With a sample file containing these lines:
ADBC|Banks
ADB|Banks
ADBC|Banks
you can see the effect of various -k
options:
$ sort -s -t '|' -k 1 --debug file
sort: using simple byte comparison
ADBC|Banks
___________
ADBC|Banks
__________
ADB|Banks
_________
$ sort -s -t '|' -k 1,1 --debug file
sort: using simple byte comparison
ADBC|Banks
_____
ADB|Banks
___
ADBC|Banks
____
$ sort -s -t '|' -k 1b,1 --debug file
sort: using simple byte comparison
ADB|Banks
___
ADBC|Banks
____
ADBC|Banks
____
Now you're probably wondering about the -s
I threw in there. Without it, there is a default last-resort comparison of the whole line as a string, which applies to lines with equal keys. That's not normally a problem and you probably don't need to use -s
. It's just that when using --debug
, the last-resort comparison clutters the list so I like to use -s
to get rid of it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With