Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Join gives warning "file1 is not in sorted order"

Tags:

bash

unix

Was testing a legacy script in the new version of bash 4.1.2(1)-release , and encountered this warning in the console:

join: file 1 is not in sorted order
join: file 2 is not in sorted order

I am quite sure that both of the files are sorted. The files actually merged properly.

Below is the script:

cat $FILE1_PATH'.processed.1' | cut -d'|' -f4,8 | sort | uniq -u  > $FILE1_PATH.'processed.2'
cat $FILE2_PATH'.processed.1' | cut -d'|' -f1,8 | sort | uniq -u > $FILE2_PATH.'processed.2'
join -t$'|' -1 1 -2 1 $FILE1_PATH.'processed.2' $FILE2_PATH.'processed.2' > $MERGEFILE_PATH

To job of this script :

  1. extract field 4 and 8 from file 1
  2. extract field 1 and 8 from file 2
  3. combine the extracted fields, using join key file1.field4 = file2.field1
  4. remove any duplicates.

FILE1.processed.2 :

21VIANET GP INC|GOV
ABN|ABN1
ABN|ABN2
ABOC|ABOC1
ABOC|ABOC1
ABOC|ABOC2
....

FILE2.processed.2 :

ABN|Banks
ABOC|Pharmaceuticals
GOV|Government Agency 
....

OUTPUT:

GOV|21VIANET GP INC|Government Agency
ABN|ABN1|Banks
ABN|ABN2|Banks
ABOC|ABOC1|Pharmaceuticals
ABOC|ABOC2|Pharmaceuticals  
....

Running the same script in the bash version 3.2.25(1)-release gives no warning. Any idea to solve the warning?

UPDATE: Seems that the cause was caused by these lines in the input files...

ADBC|Banks 
ADB|Banks

Join expects ADBC to be positioned after ADB, like below :

ADB|Banks
ADBC|Banks

However I tried to change my sort script from sort -u to sort -t$'|' -k1 (sort based on the first field ) however still not working...

like image 282
Rudy Avatar asked Oct 19 '25 14:10

Rudy


1 Answers

The suggestion in the join man page is to use sort -k 1b,1 when you're joining on field 1. (It says "when join has no options" but as far as field selection is concerned, your join is equivalent to no options. -1 1 and -2 1 are the defaults.) You can add -t '|' to that and it will match your join perfectly.

-k1 means all fields from 1 to the end. -k1,1 means just field 1. The b is necessary if you have leading whitespace and want to ignore it. sort syntax is weird. And this is after POSIX redesigned it to try to make it sensible. If you ever write a sort command that doesn't look complicated, it's probably not doing what you wanted.

Add --debug to your sort command to see what it's using as a key. With a sample file containing these lines:

ADBC|Banks
ADB|Banks
 ADBC|Banks

you can see the effect of various -k options:

$ sort -s -t '|' -k 1 --debug file
sort: using simple byte comparison
 ADBC|Banks
___________
ADBC|Banks
__________
ADB|Banks
_________
$ sort -s -t '|' -k 1,1 --debug file
sort: using simple byte comparison
 ADBC|Banks
_____
ADB|Banks
___
ADBC|Banks
____
$ sort -s -t '|' -k 1b,1 --debug file
sort: using simple byte comparison
ADB|Banks
___
ADBC|Banks
____
 ADBC|Banks
 ____

Now you're probably wondering about the -s I threw in there. Without it, there is a default last-resort comparison of the whole line as a string, which applies to lines with equal keys. That's not normally a problem and you probably don't need to use -s. It's just that when using --debug, the last-resort comparison clutters the list so I like to use -s to get rid of it.


Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!