Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert string arrays to data frame in R

Tags:

dataframe

r

Suppose I have a string array such like:

sa<-c("HLA:HLA00001 A*01:01:01:01 1098 bp",
      "HLA:HLA01244 A*01:01:02 546 bp",
      "HLA:HLA01971 A*01:01:03 895 bp")

My question is what is the best way to convert it to a data frame such like:

  Seq          Type             Length
1 HLA:HLA00001 A*01:01:01:01    1098 bp
2 HLA:HLA01244 A*01:01:02       546 bp
3 HLA:HLA01971 A*01:01:03       895 bp
like image 637
David Z Avatar asked Dec 07 '25 07:12

David Z


2 Answers

Using the ‹dplyr› and ‹tidyr› packages, this is trivial:

  1. Put data into a data_frame,
  2. separate columns:
data_frame(sa) %>%
    separate(sa, c('Seq', 'Type', 'Length'), sep = ' ', extra = 'drop', convert = TRUE)
Source: local data frame [3 x 3]

           Seq          Type Length
         (chr)         (chr)  (int)
1 HLA:HLA00001 A*01:01:01:01   1098
2 HLA:HLA01244    A*01:01:02    546
3 HLA:HLA01971    A*01:01:03    895

This (intentionally) drops the unit from the last column, which is now redundant (as it will always be the same), and converts it to an integer. If you want to keep it, use extra = 'merge' instead.

You can further separate the Type column by the application of another ‹tidyr› function, quite similar to separate, but specifying which parts to match: extract. This function allows you to provide a regular expression (a must-learn tool if you don’t know it already!) that specifies which parts of a text to match. These parts are in parentheses here:

'(A\\*\\d{2}:\\d{2}):(.*)'

This means: extract two groups — the first group containing the string “A*” followed by two digits, “:” and another two digits. And the second group containing all the rest of the text, after a separating “:” (I hope I’ve captured the specification of HLA alleles correctly, I’ve never worked with this type of data).

Put together with the code from above:

data_frame(sa) %>%
    separate(sa, c('Seq', 'Type', 'Length'), sep = ' ', extra = 'drop', convert = TRUE) %>%
    extract(Type, c('Group', 'Allele'), regex = '(A\\*\\d{2}:\\d{2}):(.*)')
Source: local data frame [3 x 4]

           Seq   Group Allele Length
         (chr)   (chr)  (chr)  (int)
1 HLA:HLA00001 A*01:01  01:01   1098
2 HLA:HLA01244 A*01:01     02    546
3 HLA:HLA01971 A*01:01     03    895
like image 114
Konrad Rudolph Avatar answered Dec 09 '25 20:12

Konrad Rudolph


Use read.table, which will require some extra effort since you have the delimiter within the column that you want to keep together:

df <- read.table(text = sa, col.names = c("Seq", "Type", "Length", "Unit"))
df$Length <- paste(df$Length, df$Unit)
df[,-4]
#            Seq          Type  Length
# 1 HLA:HLA00001 A*01:01:01:01 1098 bp
# 2 HLA:HLA01244    A*01:01:02  546 bp
# 3 HLA:HLA01971    A*01:01:03  895 bp
like image 42
Psidom Avatar answered Dec 09 '25 20:12

Psidom



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!