Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently reformat column entries in large data set in R

Tags:

r

I have a large (6 million row) table of values that I believe needs to be reformatted before it can be used for comparison to my data set. The table has 3 columns that I care about. The first column contains nucleotide base changes, in the form of C>G, A>C, A>G, etc. I'd like to split these into two separate columns. The second column has the chromosome and base position, formatted as 10:130448, 2:40483, 5:30821291, etc. I would also like to split this into two columns. The third column has the allelic fraction in a number of sample populations, formatted like .02/.03/.20. I'd like to extract the third fraction into a new column.

The problem is that the code I have written is currently extremely slow. It looks like it will take about a day and a half just to run. Is there something I'm missing here? Any suggestions would be appreciated.

My current code does the following: pos, change, and fraction each receive a vector of the above values split use strsplit. I then loop through the entire database, getting the ith value from those three vectors, and creating new columns with the values I want.

Once the database has been formatted, I should be able to easily check a large number of samples by chromosome number, base, reference allele, alternate allele, etc.

pos <- strsplit(total.esp$NCBI.Base, ":")
change <- strsplit(total.esp$Alleles, ">")
fraction <- strsplit(total.esp$'MAFinPercent(EA/AA/All)', "/")
for (i in 1:length(pos)){
    current <- pos[[i]]
    mutation <- change[[i]]
    af <- fraction[[i]]
    total.esp$chrom[i] <- current[1]
    total.esp$base[i] <- current [2]
    total.esp$ref[i] <- mutation[1]
    total.esp$alt[i] <- mutation[2]
    total.esp$af[i] <- af[3]

}

Thanks!

like image 671
farail Avatar asked Jan 26 '26 17:01

farail


1 Answers

Here is a data.table solution. We convert the 'data.frame' to 'data.table' (setDT(df1)), loop over the Subset of Data.table (.SD) with lapply, use tstrsplit and split the columns by specifying the split character, unlist the output with recursive=FALSE.

library(data.table)#v1.9.6+
setDT(df1)[, unlist(lapply(.SD, tstrsplit,
        split='[>:/]', type.convert=TRUE), recursive=FALSE)]
#   Alleles1 Alleles2 NCBI.Base1 NCBI.Base2 MAFinPercent1 MAFinPercent2
#1:        C        G         10     130448          0.02          0.03
#2:        A        C          2      40483          0.05          0.03
#3:        A        G          5   30821291          0.02          0.04
#   MAFinPercent3
#1:          0.20
#2:          0.04
#3:          0.03

NOTE: I assumed that there are only 3 columns in the dataset. If there are more columns, and want to do the split only for the 3 columns, we can specify the .SDcols= 1:3 i.e. column index or the actual column names, assign (:=) the output to new columns and subset the columns that are only needed in the output.

data

df1 <- data.frame(Alleles =c('C>G', 'A>C', 'A>G'), 
   NCBI.Base=c('10:130448', '2:40483', '5:30821291'), 
   MAFinPercent= c('.02/.03/.20', '.05/.03/.04', '.02/.04/.03'), 
   stringsAsFactors=FALSE)
like image 70
akrun Avatar answered Jan 28 '26 05:01

akrun



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!