Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find optimal string content that minimizes the MSE of character count vectors with its reference string

I have the following reference sequence:

ref_seq      <- "MGHQQLYWSHPRKFGQGSRSCRVTSNRHGLIRKYGLNMSRQSFR"

and this seed pattern string:

seed_pattern <- "FKDHKHIDVKDRHRTRHLAK??????????"

There are 10 wildcards (?) in that pattern.

Given this functions:

aa_count_normalized <- function(x) {
  AADict <- c(
    "A", "R", "N", "D", "C", "E", "Q", "G", "H",
    "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"
  )
  AAC <- summary(factor(strsplit(x, split = "")[[1]], levels = AADict),
                 maxsum = 21
  ) / nchar(x)
  AAC
}

aa_count <- function(x) {
  AADict <- c(
    "A", "R", "N", "D", "C", "E", "Q", "G", "H",
    "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"
  )
  AAC <- summary(factor(strsplit(x, split = "")[[1]], levels = AADict),
                 maxsum = 21
  ) 
  AAC
}

I can get :

# we need to normalize refseq_aa content with respect
# to the length of seed_pattern, to accommodate the length
# difference between the two.
> refseq_aa_content <- aa_count_normalized(ref_seq) * nchar(seed_pattern) 
> refseq_aa_content
        A         R         N         D         C         E 
0.0000000 4.7727273 1.3636364 0.0000000 0.6818182 0.0000000 
        Q         G         H         I         L         K 
2.7272727 3.4090909 2.0454545 0.6818182 2.0454545 1.3636364 
        M         F         P         S         T         W 
1.3636364 1.3636364 0.6818182 4.0909091 0.6818182 0.6818182 
        Y         V 
1.3636364 0.6818182  

What I want to do is to replace the wild cards of the seed pattern - while keeping the non-wildcards as it is - with combinations of residue taken from:

 AADict <- c(
        "A", "R", "N", "D", "C", "E", "Q", "G", "H",
        "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"
      ) 

such that the mean squared error (MSE) of the final amino acid count of the seed sequence and the normalized reference sequence count is minimized.

With this MSE function:

mse <- function (ref, new_seq) {
  return(mean((ref - new_seq)^2))
}

and with this final seed sequences:

seed_final.1 <- aa_count("FKDHKHIDVKDRHRTRHLAKRQQGGGSSSY")
seed_final.2 <- aa_count("FKDHKHIDVKDRHRTRHLAKRQQQGGGSSY")
seed_final.3 <- aa_count("FKDHKHIDVKDRHRTRHLAKSSSGGGRRQQ") # onyambu's

I get

> mse(refseq_aa_content, seed_final.1 )
[1] 1.501446
> mse(refseq_aa_content, seed_final.2 )
[1] 1.63781
> mse(refseq_aa_content, seed_final.3 )
[1] 1.560537

The seed_final.1 is the optimal exact solution, because it has the lowest MSE. Namely the 10 ?s is to be replaced with:

G Q R S Y 
3 2 1 3 1 (total 10)

How can I create an efficient R code to return FKDHKHIDVKDRHRTRHLAKRQQGGGSSSY as the answer.

like image 276
scamander Avatar asked Sep 07 '25 16:09

scamander


1 Answers

You can model your problem as integer quadratic problem where you want to minimize:

sum(r^2) - 2 sum(z * r)

with constraints:

sum(r) = k

r[i] nonegative integer

where:

  • r[i] how many ith letters of AADict you need to add to seed_pattern
  • z[i] = n(y)/n(x) * x[i] - y[i]
  • x[i] counts of ith letter of AADict in ref_seq
  • y[i] counts of ith letter of AADict in seed_pattern
  • n(x) number of characters in ref_seq
  • n(y) number of charecters in seed_pattern
  • k number of wild card characters in seed_pattern

I din't manage to find mixed-integer quadratic solver in R (free one) so here is heuristic using DEoptimR:

ref_seq <- "MGHQQLYWSHPRKFGQGSRSCRVTSNRHGLIRKYGLNMSRQSFR"
seed_pattern <- "FKDHKHIDVKDRHRTRHLAK??????????"

AADict <- c(
  "A", "R", "N", "D", "C", "E", "Q", "G", "H",
  "I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"
)

aminoSummary <- function(x){
  
  f <- factor(strsplit(x, split = "")[[1]], levels = AADict)
  
  list(
    l = nchar(x),
    k = sum(is.na(f)),
    z = table(f)
  )
}

x <- aminoSummary(ref_seq)
y <- aminoSummary(seed_pattern)

M <- length(AADict)

res <- DEoptimR::JDEoptim(
  lower = rep(0, M),
  upper = rep(y$k, M) + 1,
  fn = function(r, z, k){
    r <- floor(r)
    sum(r * r) - 2 * sum(z * r)
  },
  constr = function(r, z, k) sum(floor(r)) - k,
  meq = 1,
  z = as.vector(x$z * y$l / x$l - y$z),
  k = y$k
)

rep(AADict, floor(res$par))
[1] "R" "Q" "Q" "G" "G" "G" "S" "S" "S" "Y"
like image 165
det Avatar answered Sep 10 '25 07:09

det