Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort variable according to multiple regex substrings

I am trying to order a variable in R which is a list of file names that contains three substrings that I want to order on. The files names are formatted as such:

MAF001.incMHC.zPGS.S1
MAF002.incMHC.zPGS.S1
MAF003.incMHC.zPGS.S1
MAF001.incMHC.zPGS.S2
MAF002.incMHC.zPGS.S2
MAF003.incMHC.zPGS.S2
MAF001.noMHC_incRS148.zPGS.S1
MAF002.noMHC_incRS148.zPGS.S1
MAF003.noMHC_incRS148.zPGS.S1
MAF001.noMHC_incRS148.zPGS.S2
MAF002.noMHC_incRS148.zPGS.S2
MAF003.noMHC_incRS148.zPGS.S2
MAF001.noMHC.zPGS.S1
MAF002.noMHC.zPGS.S1
MAF003.noMHC.zPGS.S1
MAF001.noMHC.zPGS.S2
MAF002.noMHC.zPGS.S2
MAF003.noMHC.zPGS.S2

I want to order this list firstly on MAF substring, then MHC substring, then S substring, such that the order is:

MAF001.incMHC.zPGS.S1
MAF001.noMHC_incRS148.zPGS.S1
MAF001.noMHC.zPGS.S1
MAF001.incMHC.zPGS.S2
MAF001.noMHC_incRS148.zPGS.S2
MAF001.noMHC.zPGS.S2
MAF002.incMHC.zPGS.S1
MAF002.noMHC_incRS148.zPGS.S1
MAF002.noMHC.zPGS.S1
MAF002.incMHC.zPGS.S2
MAF002.noMHC_incRS148.zPGS.S2
MAF002.noMHC.zPGS.S2
MAF003.incMHC.zPGS.S1
MAF003.noMHC_incRS148.zPGS.S1
MAF003.noMHC.zPGS.S1
MAF003.incMHC.zPGS.S2
MAF003.noMHC_incRS148.zPGS.S2
MAF003.noMHC.zPGS.S2

I have had a play around with gsub after seeing the answer to this question regarding a single substring: R Sort strings according to substring

But I am not sure how to extend this idea to multiple substrings (of mixed character and numerical classes) within a string.

like image 366
Lynsey Avatar asked Sep 15 '25 03:09

Lynsey


1 Answers

Here's a one-liner in base R:

bar <- foo[order(sapply(strsplit(foo, "\\."), function(x) paste(x[1], x[4])))]
head(data.frame(result = bar), 10)

                          result
1          MAF001.incMHC.zPGS.S1
2  MAF001.noMHC_incRS148.zPGS.S1
3           MAF001.noMHC.zPGS.S1
4          MAF001.incMHC.zPGS.S2
5  MAF001.noMHC_incRS148.zPGS.S2
6           MAF001.noMHC.zPGS.S2
7          MAF002.incMHC.zPGS.S1
8  MAF002.noMHC_incRS148.zPGS.S1
9           MAF002.noMHC.zPGS.S1
10         MAF002.incMHC.zPGS.S2

Explanation:

  • Split string by . using strsplit: strsplit(foo, "\\.")
  • Extract and combine elements 1 and 4: paste(x[1], x[4])
  • Get order of all combinations using order
  • Get corresponding value from foo[]

Data (foo):

c("MAF001.incMHC.zPGS.S1", "MAF002.incMHC.zPGS.S1", "MAF003.incMHC.zPGS.S1", 
"MAF001.incMHC.zPGS.S2", "MAF002.incMHC.zPGS.S2", "MAF003.incMHC.zPGS.S2", 
"MAF001.noMHC_incRS148.zPGS.S1", "MAF002.noMHC_incRS148.zPGS.S1", 
"MAF003.noMHC_incRS148.zPGS.S1", "MAF001.noMHC_incRS148.zPGS.S2", 
"MAF002.noMHC_incRS148.zPGS.S2", "MAF003.noMHC_incRS148.zPGS.S2", 
"MAF001.noMHC.zPGS.S1", "MAF002.noMHC.zPGS.S1", "MAF003.noMHC.zPGS.S1", 
"MAF001.noMHC.zPGS.S2", "MAF002.noMHC.zPGS.S2", "MAF003.noMHC.zPGS.S2"
)
like image 153
pogibas Avatar answered Sep 16 '25 15:09

pogibas