Surely this is not intended?  Is this something that happens in other parts of dplyr's functionality and should I be concerned?  I love the performance and hate data.table syntax.  Is there an alternative to dplyr and data.table that is currently safe to use and still high performance?
A <- structure(list(ORDER = c(30305720L, 30334659L, 30379936L, 
                     30406397L, 30407697L, 30431950L), 
                    COST = c("0", "", "11430.52", "20196.279999999999", "0", "10445.99")), 
               .Names = c("ORDER", "COST"), 
               row.names = c(NA, 6L), 
               class = "data.frame")
B <- structure(list(ORDER = c(30334659, 30379936, 30406397, 30407697, 30431950), 
                    AREA = c(0, 2339, 2162, 23040, 475466)), 
               .Names = c("ORDER", "AREA"), 
               row.names = c(4L, 8L, 11L, 12L, 15L), 
               class = c("tbl_df", "tbl", "data.frame"))
Garbage results:
left_join(A, B)
  ORDER    COST                 AREA
1 30305720                  0   NA
2 30334659                      NA
3 30379936           11430.52   NA
4 30406397 20196.279999999999   NA
5 30407697                  0   NA
6 30431950           10445.99   NA
Effective results:
merge(A, B, all.x=T, all.y=F)
     ORDER               COST   AREA
1 30305720                  0     NA
2 30334659                         0
3 30379936           11430.52   2339
4 30406397 20196.279999999999   2162
5 30407697                  0  23040
6 30431950           10445.99 475466
I posted something similar the other day. I think what you need to do is to have ORDER as numeric (or possibly the other way around). A has ORDER has integer. But B has ORDER as numeric. At the moment, dplyr asks you to have group-by variables in the same class. I received a comment from an SO user saying that this is something Hadley and his team has been working on now. This issue will be fixed in the future.
A$ORDER <- as.numeric(A$ORDER)
left_join(A,B, by = "ORDER")
     ORDER               COST   AREA
1 30305720                  0     NA
2 30334659                         0
3 30379936           11430.52   2339
4 30406397 20196.279999999999   2162
5 30407697                  0  23040
6 30431950           10445.99 475466
UPDATE After exchanging comments with thelatemail, I decided to add more observations here.
CASE 1: Treat ORDER as numeric
A$ORDER <- as.numeric(A$ORDER)
> left_join(A,B, by = "ORDER")
     ORDER               COST   AREA
1 30305720                  0     NA
2 30334659                         0
3 30379936           11430.52   2339
4 30406397 20196.279999999999   2162
5 30407697                  0  23040
6 30431950           10445.99 475466
> left_join(B,A, by = "ORDER")
Source: local data frame [5 x 3]
     ORDER   AREA               COST
1 30334659      0                   
2 30379936   2339           11430.52
3 30406397   2162 20196.279999999999
4 30407697  23040                  0
5 30431950 475466           10445.99
If you have ORDER as integer in both A and B, that works too.
CASE 2: Treat ORDER as integer and numeric
> left_join(A,B, by = "ORDER")
     ORDER               COST AREA
1 30305720                  0   NA
2 30334659                      NA
3 30379936           11430.52   NA
4 30406397 20196.279999999999   NA
5 30407697                  0   NA
6 30431950           10445.99   NA
> left_join(B,A, by = "ORDER")
Source: local data frame [5 x 3]
     ORDER   AREA               COST
1 30334659      0                   
2 30379936   2339           11430.52
3 30406397   2162 20196.279999999999
4 30407697  23040                  0
5 30431950 475466           10445.99
As suggested by thelatemail, integer/numeric combination does not work. But numeric/integer combination works.
Given these observations, it is safe to be consistent in group-by variable at the moment. Alternatively, merge() is the way to go. It can handle integer and numeric.
> merge(A,B, by = "ORDER", all = TRUE)
     ORDER               COST   AREA
1 30305720                  0     NA
2 30334659                         0
3 30379936           11430.52   2339
4 30406397 20196.279999999999   2162
5 30407697                  0  23040
6 30431950           10445.99 475466 
> merge(B,A, by = "ORDER", all = TRUE)
     ORDER   AREA               COST
1 30305720     NA                  0
2 30334659      0                   
3 30379936   2339           11430.52
4 30406397   2162 20196.279999999999
5 30407697  23040                  0
6 30431950 475466           10445.99
UPDATE2 (as of the 8th of November, 2014)
I am using a dev version of dplyr(dplyr_0.3.0.9000), which you can download from Github.
The issue above is now solved.
left_join(A,B, by = "ORDER")
#     ORDER               COST   AREA
#1 30305720                  0     NA
#2 30334659                         0
#3 30379936           11430.52   2339
#4 30406397 20196.279999999999   2162
#5 30407697                  0  23040
#6 30431950           10445.99 475466
From the dplyr documentation:
left_join()returns all rows from
x, and all columns fromxandy. Rows inxwith no match inywill haveNAvalues in the new columns. If there are multiple matches betweenxandy, all combinations of the matches are returned.
semi_join()returns all rows from
xwhere there are matching values iny, keeping just columns fromx.A semi join differs from an inner join because an inner join will return one row of
xfor each matching row ofy, where a semi join will never duplicate rows ofx.
Is semi_join() a valuable option for you?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With