I have two data frames, the first contains 9994 rows and the second contains 60431 rows. I want to merge the two data frames such that the merged data frame contains combined columns of both data frames but only contains 9994 rows.
However, I get more than 9994 rows upon merge. How can I make sure this does not happen?
df1 = readRDS('data1.RDS')
nrow(df1)
# [1] 9994
df2 = readRDS('data2.RDS')
nrow(df2)
# [1] 60431
df = merge(df1,df2,by=c("col1","col2"))
nrow(df)
# [1] 10057
df = merge(df1,df2,by=c("col1","col2"),all.x=TRUE)
nrow(df)
# [1] 10057
nrow(na.omit(df))
# [1] 10057
EDIT : Following akrun's comment. Yes, there were duplicates in the second data frame
nrow(unique(df2[,c("col1","col2")]))
# [1] 60263
nrow(df2)
# [1] 60431
How can I take only one row from a data frame if there are multiple for the same {col1,col2} combination. When I merge, I would like to have only 9994 rows.
The rows in the two data frames that match on the specified columns are extracted, and joined together. If there is more than one match, all possible matches contribute one row each.
Use the left_join Function to Merge Two R Data Frames With Different Number of Rows. left_join is another method from the dplyr package. It takes arguments similar to the full_join function, but left_join extracts all rows from the first data frame and all columns from both of them.
To merge two data frames (datasets) horizontally, use the merge() function in the R language. To bind or combine rows in R, use the rbind() function. The rbind() stands for row binding.
The R merge function allows merging two data frames by common columns or by row names. This function allows you to perform different database (SQL) joins, like left join, inner join, right join or full join, among others.
This should work, be sure to sort df2 first so you select the right rows.
df = merge(
  df1,
  df2[!duplicated(df2[, c("col1","col2")], ],
  by=c("col1","col2"),
  all.x=TRUE
)
What happens here: I merge the two data frames by the columns we want to merge by, but I first select only the first occurrence of any combination of col1 and col2 from the second data.frame df2.
duplicated checks if lines are duplicated if called with a data.frame. I select col1 and col2 from df2, so duplicated returns TRUE for rows with the same col1 and col2 but differences in other cols. Then I select only the rows which are not duplicated. 
(Read the [-expressions carefully, and check the function calls from the inside out, to get the intermediate results)
edit: added explanation as suggested in comments
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With