How can i reformat the data.frame df1 based on the GeneID. Have to group the table based on common GeneID. also i want to strsplit the position
df1 =
GeneID Common Organism Name Position
3 mouse 10090 Acadm Chr5:26082574-26089291(-)
3 human 9606 ACADM Chr5:15028950-15032998(-)
6 mouse 10090 Acat1 Chr5:25999022-26004798(-)
6 human 9606 ACAT1 Chr5:15471699-15477027(-)
7 human 9606 NLN Chr5:26257691-26264308(+)
8 mouse 10090 canct1 Chr5:14910122-14914899(-)
9 mouse 9606 Gm10220 Chr5:25936465-25943267(-)
9 mouse 9606 Gm10354 Chr5:25949797-25954344(-)
9 mouse 9606 Gm1979 Chr5:11594913-11599784(+)
9 human 10090 TRIL Chr7:28953358-28958413(-)
expected out put
Gene.ID M.Gene M.Chr M.start M.end H.Gene H.Chr H.start H.end
3 Acadm 5 26082574 26089291 ACADM 5 15028950 15032998
6 Acat1 5 25999022 26004798 ACAT1 5 15471699 15477027
7 NA NA NA NA NLN 5 26257691 26264308
8 canct1 5 14910122 14914899 NA NA NA NA
9 Gm10220 5 25936465 25943267 TRIL 7 28953358 28958413
9 Gm10354 5 25949797 25954344 TRIL 7 28953358 28958413
9 Gm1979 5 1159491 11599784 TRIL 7 28953358 28958413
9 Gm21149 5 11594913 11599784 TRIL 7 28953358 28958413
We could use the devel version of 'data.table' ie. v1.9.5. Instructions to install are here.
We change the 'data.frame' to 'data.table' (setDT(df1)). Using the tstrsplit, we split the 'Position' by all non-numeric characters ([^0-9]+) to create new columns ('Chr', 'start', 'end').
library(data.table)#v1.9.5+
DT <- setDT(df1)[, c('Chr', 'start', 'end') :=tstrsplit(Position, '[^0-9]+')[-1]]
Create an sequence column ('ind') grouped by 'GeneID' and 'Common'
DT[, ind:=1:.N, .(GeneID, Common)]
dcast from the devel version can take multiple value.var columns and change the 'long' format to 'wide' format. We can replace the NA values by the non-NA value in the dataset.
dcast(DT, GeneID+ind~substr(Common, 1, 1), value.var=names(DT)[c(4,6:8)])[,
lapply(.SD, function(x) x[!is.na(x)]) , GeneID, .SDcols=h_Name:m_end]
# GeneID h_Name m_Name h_Chr m_Chr h_start m_start h_end m_end
#1: 3 ACADM Acadm 5 5 15028950 26082574 15032998 26089291
#2: 6 ACAT1 Acat1 5 5 15471699 25999022 15477027 26004798
#3: 7 NLN NA 5 NA 26257691 NA 26264308 NA
#4: 8 NA canct1 NA 5 NA 14910122 NA 14914899
#5: 9 TRIL Gm10220 7 5 28953358 25936465 28958413 25943267
#6: 9 TRIL Gm10354 7 5 28953358 25949797 28958413 25954344
#7: 9 TRIL Gm1979 7 5 28953358 11594913 28958413 11599784
Another option using lapply
# using split method from akrun's answer
library(data.table)#v1.9.5+
DT <- setDT(df1)[, c('Chr', 'start', 'end') :=tstrsplit(Position, '[^0-9]+')[-1]]
out = setDF(Reduce(function(...) merge(..., by="GeneID", all = T),
lapply(split(DT, DT$Common),
function(x) subset(x, select = -c(Common, Position, Organism)))))
colnames(out) = gsub("x", "H", colnames(out))
colnames(out) = gsub("y", "M", colnames(out))
#> out
# GeneID Name.H Chr.H start.H end.H Name.M Chr.M start.M end.M
#1 3 ACADM 5 15028950 15032998 Acadm 5 26082574 26089291
#2 6 ACAT1 5 15471699 15477027 Acat1 5 25999022 26004798
#3 7 NLN 5 26257691 26264308 <NA> <NA> <NA> <NA>
#4 8 <NA> <NA> <NA> <NA> canct1 5 14910122 14914899
#5 9 TRIL 7 28953358 28958413 Gm10220 5 25936465 25943267
#6 9 TRIL 7 28953358 28958413 Gm10354 5 25949797 25954344
#7 9 TRIL 7 28953358 28958413 Gm1979 5 11594913 11599784
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With