If I want to use join on my Ubuntu, I need to first sort both files lexicographically (according to join --help), and only then join them:
tail -n +2 meta/201508_1 | sort -k 1b,1 > meta.txt
tail -n +2 keywords/copy | sort -k 1b,1 > keywords.txt
join meta.txt keywords.txt -1 1 -2 1 -t $'\t'
(I also remove the header from both of them using tail)
But instead of sorting files lexicographically, I would like to sort them numerically: the first column in both files is an ID.
tail -n +2 meta/201508_1 | sort -k1 -n > meta.txt
tail -n +2 keywords/copy.txt | sort -k1 -n > keywords.txt
And then join. But for join these files look unsorted:
join: meta.txt:10: is not sorted: 1023 301000 en
join: keywords.txt:2: is not sorted: 10 keyword1
If I add --nocheck-order to join, it doesn't join properly - it outputs just one line.
How do I join two files on their numerical ID in bash?
Sample (columns are tab-separated):
file 1
id volume lang
1 10 en
2 20 en
5 30 en
6 40 en
10 50 en
file 2
id keyword
4 kw1
2 kw2
10 kw3
1 kw4
3 kw5
desired output
1 kw4 10 en
2 kw2 20 en
10 kw3 50 en
Both of these work. The first one (sort -b is recommended on the Mac)
join <(sed 1d file1 | sort -b) <(sed 1d file2 | sort -b) | sort -n
the Linux man page recommends sort -k 1b,1
join <(sed 1d file1 | sort -k 1b,1) <(sed 1d file2 | sort -k 1b,1) | sort -n
In any case, you need to sort them lexicographically to join them. At the end you can still sort the result numerically.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With