How to format data.frame?

Question

How can i reformat the data.frame df1 based on the GeneID. Have to group the table based on common GeneID. also i want to strsplit the position

df1 =

GeneID  Common Organism Name        Position    
    3   mouse   10090   Acadm   Chr5:26082574-26089291(-)   
    3   human   9606    ACADM   Chr5:15028950-15032998(-)   
    6   mouse   10090   Acat1   Chr5:25999022-26004798(-)   
    6   human   9606    ACAT1   Chr5:15471699-15477027(-)   
    7   human   9606    NLN Chr5:26257691-26264308(+)   
    8   mouse   10090   canct1  Chr5:14910122-14914899(-)   
    9   mouse   9606    Gm10220 Chr5:25936465-25943267(-)   
    9   mouse   9606    Gm10354 Chr5:25949797-25954344(-)   
    9   mouse   9606    Gm1979  Chr5:11594913-11599784(+)   
    9   human   10090   TRIL    Chr7:28953358-28958413(-)

expected out put

Gene.ID M.Gene  M.Chr M.start   M.end      H.Gene H.Chr H.start     H.end
    3   Acadm   5   26082574    26089291    ACADM   5   15028950    15032998
    6   Acat1   5   25999022    26004798    ACAT1   5   15471699    15477027
    7   NA      NA  NA          NA           NLN    5   26257691    26264308
    8   canct1  5   14910122    14914899     NA       NA    NA      NA
    9   Gm10220 5   25936465    25943267    TRIL    7   28953358    28958413
    9   Gm10354 5   25949797    25954344    TRIL    7   28953358    28958413
    9   Gm1979  5   1159491     11599784    TRIL    7   28953358    28958413
    9   Gm21149 5   11594913    11599784    TRIL    7   28953358    28958413

akrun · Accepted Answer

We could use the devel version of 'data.table' ie. v1.9.5. Instructions to install are here.

We change the 'data.frame' to 'data.table' (setDT(df1)). Using the tstrsplit, we split the 'Position' by all non-numeric characters ([^0-9]+) to create new columns ('Chr', 'start', 'end').

library(data.table)#v1.9.5+
DT <- setDT(df1)[, c('Chr', 'start', 'end') :=tstrsplit(Position, '[^0-9]+')[-1]]

Create an sequence column ('ind') grouped by 'GeneID' and 'Common'

DT[, ind:=1:.N, .(GeneID, Common)]

dcast from the devel version can take multiple value.var columns and change the 'long' format to 'wide' format. We can replace the NA values by the non-NA value in the dataset.

dcast(DT, GeneID+ind~substr(Common, 1, 1), value.var=names(DT)[c(4,6:8)])[,
 lapply(.SD, function(x) x[!is.na(x)]) , GeneID, .SDcols=h_Name:m_end]
#    GeneID h_Name  m_Name h_Chr m_Chr  h_start  m_start    h_end    m_end
#1:      3  ACADM   Acadm     5     5 15028950 26082574 15032998 26089291
#2:      6  ACAT1   Acat1     5     5 15471699 25999022 15477027 26004798
#3:      7    NLN      NA     5    NA 26257691       NA 26264308       NA
#4:      8     NA  canct1    NA     5       NA 14910122       NA 14914899
#5:      9   TRIL Gm10220     7     5 28953358 25936465 28958413 25943267
#6:      9   TRIL Gm10354     7     5 28953358 25949797 28958413 25954344
#7:      9   TRIL  Gm1979     7     5 28953358 11594913 28958413 11599784

Veerendra Gadekar · Answer

Another option using lapply

# using split method from akrun's answer
library(data.table)#v1.9.5+
DT <- setDT(df1)[, c('Chr', 'start', 'end') :=tstrsplit(Position, '[^0-9]+')[-1]]

out = setDF(Reduce(function(...) merge(..., by="GeneID", all = T),
            lapply(split(DT, DT$Common), 
            function(x) subset(x, select = -c(Common, Position, Organism)))))  

colnames(out) = gsub("x", "H", colnames(out)) 
colnames(out) = gsub("y", "M", colnames(out))

#> out
#  GeneID Name.H Chr.H  start.H    end.H  Name.M Chr.M  start.M    end.M
#1      3  ACADM     5 15028950 15032998   Acadm     5 26082574 26089291
#2      6  ACAT1     5 15471699 15477027   Acat1     5 25999022 26004798
#3      7    NLN     5 26257691 26264308    <NA>  <NA>     <NA>     <NA>
#4      8   <NA>  <NA>     <NA>     <NA>  canct1     5 14910122 14914899
#5      9   TRIL     7 28953358 28958413 Gm10220     5 25936465 25943267
#6      9   TRIL     7 28953358 28958413 Gm10354     5 25949797 25954344
#7      9   TRIL     7 28953358 28958413  Gm1979     5 11594913 11599784

How to format data.frame?

Tags:

r

bioinformatics

beginner

2 Answers

akrun

Veerendra Gadekar

Recent Activity

Donate For Us

How to format data.frame?

Tags:

r

bioinformatics

beginner

2 Answers

akrun

Veerendra Gadekar

Related questions

Recent Activity

Donate For Us