I want to remove duplicates based on column 'User' but only the first instance where it appears.
DF:
User No
A 1
B 1
A 2
A 3
A 4
C 1
B 2
D 1
Result: (A1 and B1 removed)
User No
A 2
A 3
A 4
C 1
B 2
D 1
I've been unsuccessful with using the duplicated function.
Any help would be appreciated! Thanks!
If i understand correctly, this should work
library(dplyr)
dd %>% group_by(User) %>% filter(duplicated(User) | n()==1)
Here is an option using data.table. We convert the 'data.frame' to 'data.table' (setDT(DF)). Grouped by the 'User' column, we select all the rows except the first (tail(.SD, -1)) where .SD is Subset of Data.table. But, this will also remove the row if there is only a single row for a 'User' group. We can avoid that by using an if/else condition stating that if the number of rows are greater than 1 (.N>1), we remove the first row or else return the row (.SD).
library(data.table)
setDT(DF)[, if(.N>1) tail(.SD,-1) else .SD , by = User]
# User No
#1: A 2
#2: A 3
#3: A 4
#4: B 2
#5: C 1
#6: D 1
Or a similar option as in @MrFlick's dplyr code would be using a logical condition with duplicated and .N (number of rows). We create a column 'N' by checking 'User' groups that have a single observation (.N==1), in the next step, we subset the rows that have are either TRUE for N or is duplicated for the 'User'. The duplicated returns TRUE values for duplicate rows leaving the first value as FALSE.
setDT(DF)[DF[, N:=.N==1, by = User][, N|duplicated(User)]][,N:=NULL][]
Or a base R option would be using ave to get the logical index ('indx2') by checking if the length for each 'User' group is 1 or not. We can use this along with the duplicated as described above to subset the dataset.
indx2 <- with(DF, ave(seq_along(User), User, FUN=length)==1)
DF[duplicated(DF$User)|indx2,]
# User No
#3 A 2
#4 A 3
#5 A 4
#6 C 1
#7 B 2
#8 D 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With