I have list of data.frame that needed to apply very specific duplicate removal method. I have reason for using specific conditional duplicate removal for this data.frame list. However, duplicate removal condition for each individual data.frame is different. I want to do complete duplicate removal for first list element; for second list element, I need to search the row that appear more than twice (freq >2), and only keep one row; for third list element, search over the row that appear more than three times (freq>3), and keep two rows in that data.frame. I am trying to get more programmatic, dynamic solution for this data manipulation task. I tried my shot to get nice solution, but couldn't obtain my desired output. How can I make this happen easily ? Any way to accomplish this task more efficiently respect to my specific output ? Any idea please ?
reproducible data.frame:
myList <- list(
bar= data.frame(start.pos=c(9,19,34,54,70,82,136,9,34,70,136,9,82,136),
end.pos=c(14,21,39,61,73,87,153,14,39,73,153,14,87,153),
pos.score=c(48,6,9,8,4,15,38,48,9,4,38,48,15,38)),
cat = data.frame(start.pos=c(7,21,21,72,142,7,16,21,45,72,100,114,142,16,72,114),
end.pos=c(10,34,34,78,147,10,17,34,51,78,103,124,147,17,78,124),
pos.score=c(53,14,14,20,4,53,20,14,11,20,7,32,4,20,20,32)),
foo= data.frame(start.pos=c(12,12,12,58,58,58,118,12,12,44,58,102,118,12,58,118),
end.pos=c(36,36,36,92,92,92,139,36,36,49,92,109,139,36,92,139),
pos.score=c(48,48,48,12,12,12,5,48,48,12,12,11,5,48,12,5))
)
Because myList is outcome of custom function, data.frame can't be detached. I am seeking more programmatic solution to make this specific duplicate removal for my data. How can I make specific duplicate removal if input is list of data.frame ?
my desired output as follow:
expectedList <- list(
bar= data.frame(start.pos=c(9,19,34,54,70,82,136),
end.pos=c(14,21,39,61,73,87,153),
pos.score=c(48,6,9,8,4,15,38)),
cat= data.frame(start.pos=c(7,21,72,142,7,16,45,100,114,142,16,114),
end.pos=c(10,34,78,147,10,17,51,103,124,147,17,124),
pos.score=c(53,14,20,4,53,20,11,7,32,4,20,32)),
foo= data.frame(start.pos=c(12,12,44,58,58,118,102,118,118),
end.pos=c(36,36,49,92,92,139,109,139,139),
pos.score=c(48,48,12,12,12,5,11,5,5))
)
Edit :
in second data.frame cat, I am going to look up the rows that appear three times, and keep that rows only once; if row appear twice, I don't do duplicate removal on that.
for third data.frame foo, I am going to check the rows that appear more than three times, and keep two same rows instead. This is what I am trying to make very specific duplicate removal for each data.frame. How can I get my output ?
How can I get my desired data.frame list? How can I make this happen easily? Thanks a lot !
We can do this Map to subset the rows of the list elements based on a logical index created with the corresponding number specified in the vector (1:3). Convert the data.frame elements in list to data.table (setDT(x)) , grouped by the columns ('start.pos', 'end.pos', 'pos.score'), we get the number of rows (.N), create a logical index with if/else and get the sequence of rows that satisfies the condition specified in the OP's post, use .I to get the row index, extract that index column ($V1) and use that to subset the dataset.
library(data.table)
res <- Map(function(x,y) setDT(x)[x[, .I[if(.N > y) seq_len(pmax(y-1, 1))
else seq_len(.N)] , .(start.pos, end.pos, pos.score)]$V1], myList, 1:3)
sapply(res, nrow)
#bar cat foo
# 7 12 9
sapply(expectedList, nrow)
#bar cat foo
#7 12 9
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With