I'm trying to understand the underlying logic of how the result of a non-equi join in data.table is ordered within each level of the on-variable.
Just to make it clear from the start: I have no problem with the order itself, or to order the output in a desired way after the join. However, because I find the output from all other data.table operations highly consistent, I suspect there is a ordering pattern to be revealed in non-equi joins as well.
I will give two examples, where two different 'large' data sets are joined with a smaller. I have tried to describe the most obvious patterns in the output within each join, as well as instances where the pattern differs between the joins of the two data sets.
library(data.table)
# the first 'large' data set
d1 <- data.table(x = c(rep(c("b", "a", "c"), each = 3), c("a", "b")),
                 y = c(rep(c(1, 3, 6), 3), 6, 6),
                 id = 1:11) # to make it easier to track the original order in the output    
#     x y  id
# 1:  b 1   1
# 2:  b 3   2
# 3:  b 6   3
# 4:  a 1   4
# 5:  a 3   5
# 6:  a 6   6
# 7:  c 1   7
# 8:  c 3   8
# 9:  c 6   9
# 10: a 6  10
# 11: b 6  11
# the small data set
d2 <- data.table(id = 1:2, val = c(4, 2))   
#     id val
# 1:   1   4
# 2:   2   2
Non-equi join between the first large data set and the small, on = .(y >= val).
d1[d2, on = .(y >= val)]
#     x y  id  i.id
# 1:  b 4   3     1 # Row 1-5, first match: y >= val[1]; y >= 4
# 2:  a 4   6     1 # The rows within this match have the same order as the original data
# 3:  c 4   9     1 # and runs consecutively from first to last match
# 4:  a 4  10     1
# 5:  b 4  11     1
# 6:  b 2   2     2 # Row 6-13, second match: y >= val[2]; y >= 2 
# 7:  a 2   5     2 # The rows within this match do not have the same order as the original data
# 8:  c 2   8     2 # Rather, they seem to be come in chunks (6-8, 9-11, 12-13) 
                    # First chunk starts with the match with lowest index, y[2] 
# 9:  b 2   3     2  
# 10: a 2   6     2 
# 11: c 2   9     2 
# 12: a 2  10     2
# 13: b 2  11     2
The second 'large' data set:
d3 <- data.table(x = rep(c("a", "b", "c"), each = 3),
                 y = c(6, 1, 3),
                 id = 1:9)
#    x y id
# 1: a 6  1
# 2: a 1  2
# 3: a 3  3
# 4: b 6  4
# 5: b 1  5
# 6: b 3  6
# 7: c 6  7
# 8: c 1  8
# 9: c 3  9
Same non-equi join between the second large data set with the small:
d3[d2, on = .(y >= val)]
#    x y   id i.id
# 1: a 4   1     1 # Row 1-3, first match (y >= 4), similar to output above
# 2: b 4   4     1
# 3: c 4   7     1
# 4: a 2   3     2 # Row 4-9, second match (y >= 2).  
# 5: b 2   6     2 # Again, rows not consecutive.
# 6: c 2   9     2 # However, now the first chunk does not start with the match with lowest index,
                   # y[3] instead of y[1]
# 7: a 2   1     2 # y[1] appears after y[3]
# 8: b 2   4     2 # ditto
# 9: c 2   7     2
Can anyone explain the logic of (1) the order within each level of the on-variable, here especially within the second match, where original order of the data isn't kept in the result. And (2) why does the order between chunks within matches differ when the two different data sets are used?
Non-Equi Join matches the column values from different tables based on an inequality based on the operators like <,>,<=,>=,!= , BETWEEN, etc. Non-Equi Join in SQL retrieves data using any operator or condition except the equality condition.
A non equi join uses comparison operators like != , >, <, >= and <= to query data from two data sources. Like a normal join, it allows you to specify the cardinality and a join type (Inner, Left Outer, Right Outer) to query data from two data sources.
What is Non-Equijoin in Oracle? The nonequijoins is such a join which match column values from different tables based on an inequality (instead of the equal sign like >, <, >=, <= ) expression. The value of the join column in each row in the source table is compared to the corresponding values in the target table.
Thanks for catching this and reporting here on SO, and filing it on Github. This, should be fixed now in the current development version (v1.10.5 at the time of writing).
It should be available sometime soon on CRAN as v1.10.6.
From NEWS entry:
- Order of rows returned in non-equi joins were incorrect in certain scenarios as reported under #1991. This is now fixed. Thanks to @Henrik-P for reporting.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With