Surely this is not intended? Is this something that happens in other parts of dplyr's functionality and should I be concerned? I love the performance and hate data.table syntax. Is there an alternative to dplyr and data.table that is currently safe to use and still high performance?
A <- structure(list(ORDER = c(30305720L, 30334659L, 30379936L,
30406397L, 30407697L, 30431950L),
COST = c("0", "", "11430.52", "20196.279999999999", "0", "10445.99")),
.Names = c("ORDER", "COST"),
row.names = c(NA, 6L),
class = "data.frame")
B <- structure(list(ORDER = c(30334659, 30379936, 30406397, 30407697, 30431950),
AREA = c(0, 2339, 2162, 23040, 475466)),
.Names = c("ORDER", "AREA"),
row.names = c(4L, 8L, 11L, 12L, 15L),
class = c("tbl_df", "tbl", "data.frame"))
Garbage results:
left_join(A, B)
ORDER COST AREA
1 30305720 0 NA
2 30334659 NA
3 30379936 11430.52 NA
4 30406397 20196.279999999999 NA
5 30407697 0 NA
6 30431950 10445.99 NA
Effective results:
merge(A, B, all.x=T, all.y=F)
ORDER COST AREA
1 30305720 0 NA
2 30334659 0
3 30379936 11430.52 2339
4 30406397 20196.279999999999 2162
5 30407697 0 23040
6 30431950 10445.99 475466
I posted something similar the other day. I think what you need to do is to have ORDER as numeric (or possibly the other way around). A has ORDER has integer. But B has ORDER as numeric. At the moment, dplyr asks you to have group-by variables in the same class. I received a comment from an SO user saying that this is something Hadley and his team has been working on now. This issue will be fixed in the future.
A$ORDER <- as.numeric(A$ORDER)
left_join(A,B, by = "ORDER")
ORDER COST AREA
1 30305720 0 NA
2 30334659 0
3 30379936 11430.52 2339
4 30406397 20196.279999999999 2162
5 30407697 0 23040
6 30431950 10445.99 475466
UPDATE After exchanging comments with thelatemail, I decided to add more observations here.
CASE 1: Treat ORDER as numeric
A$ORDER <- as.numeric(A$ORDER)
> left_join(A,B, by = "ORDER")
ORDER COST AREA
1 30305720 0 NA
2 30334659 0
3 30379936 11430.52 2339
4 30406397 20196.279999999999 2162
5 30407697 0 23040
6 30431950 10445.99 475466
> left_join(B,A, by = "ORDER")
Source: local data frame [5 x 3]
ORDER AREA COST
1 30334659 0
2 30379936 2339 11430.52
3 30406397 2162 20196.279999999999
4 30407697 23040 0
5 30431950 475466 10445.99
If you have ORDER as integer in both A and B, that works too.
CASE 2: Treat ORDER as integer and numeric
> left_join(A,B, by = "ORDER")
ORDER COST AREA
1 30305720 0 NA
2 30334659 NA
3 30379936 11430.52 NA
4 30406397 20196.279999999999 NA
5 30407697 0 NA
6 30431950 10445.99 NA
> left_join(B,A, by = "ORDER")
Source: local data frame [5 x 3]
ORDER AREA COST
1 30334659 0
2 30379936 2339 11430.52
3 30406397 2162 20196.279999999999
4 30407697 23040 0
5 30431950 475466 10445.99
As suggested by thelatemail, integer/numeric combination does not work. But numeric/integer combination works.
Given these observations, it is safe to be consistent in group-by variable at the moment. Alternatively, merge() is the way to go. It can handle integer and numeric.
> merge(A,B, by = "ORDER", all = TRUE)
ORDER COST AREA
1 30305720 0 NA
2 30334659 0
3 30379936 11430.52 2339
4 30406397 20196.279999999999 2162
5 30407697 0 23040
6 30431950 10445.99 475466
> merge(B,A, by = "ORDER", all = TRUE)
ORDER AREA COST
1 30305720 NA 0
2 30334659 0
3 30379936 2339 11430.52
4 30406397 2162 20196.279999999999
5 30407697 23040 0
6 30431950 475466 10445.99
UPDATE2 (as of the 8th of November, 2014)
I am using a dev version of dplyr(dplyr_0.3.0.9000), which you can download from Github.
The issue above is now solved.
left_join(A,B, by = "ORDER")
# ORDER COST AREA
#1 30305720 0 NA
#2 30334659 0
#3 30379936 11430.52 2339
#4 30406397 20196.279999999999 2162
#5 30407697 0 23040
#6 30431950 10445.99 475466
From the dplyr documentation:
left_join()returns all rows from
x, and all columns fromxandy. Rows inxwith no match inywill haveNAvalues in the new columns. If there are multiple matches betweenxandy, all combinations of the matches are returned.
semi_join()returns all rows from
xwhere there are matching values iny, keeping just columns fromx.A semi join differs from an inner join because an inner join will return one row of
xfor each matching row ofy, where a semi join will never duplicate rows ofx.
Is semi_join() a valuable option for you?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With