Lets say i have the following dataset:
library(data.table)
dt <- data.table(x = c(1, 2, 4, 5, 2, 3, 4))
> dt
   x
1: 1
2: 2
3: 4
4: 5
5: 2
6: 3
7: 4
I would like to cutoff after the 4th row since then the first duplicate (number 2) occurs.
Expected Output:
   x
1: 1
2: 2
3: 4
4: 5
Needless to say, I am not looking for dt[1:4, ,][] as the real dataset more "complicated".
I tried around with shift(), .I, but it didnt work.
One idea was: dt[x %in% dt$x[1:(.I - 1)], .SD, ][].
Perhaps we can use duplicated
dt[seq_len(which(duplicated(x))[1]-1)]
#   x
#1: 1
#2: 2
#3: 4
#4: 5
Or as @lmo suggested
dt[seq_len(which.max(duplicated(dt))-1)]
Here's another option:
dt[seq_len(anyDuplicated(x)-1L)]
From the help files:
anyDuplicated(): an integer or real vector of length one with value the 1-based index of the first duplicate if any, otherwise 0.
But note that if you don't have any duplicate in the column, you may run into problems with this approach (and the other approach currently posted).
To take care of that, you can modify it to:
dt[if((ix <- anyDuplicated(x)-1L) > 0) seq_len(ix) else seq_len(.N)]
This will return all rows if no duplicate is found or if there is a duplicate only until the row before the first duplicate.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With