I have the following reference sequence:
ref_seq <- "MGHQQLYWSHPRKFGQGSRSCRVTSNRHGLIRKYGLNMSRQSFR"
and this seed pattern string:
seed_pattern <- "FKDHKHIDVKDRHRTRHLAK??????????"
There are 10 wildcards (?) in that pattern.
Given this functions:
aa_count_normalized <- function(x) {
AADict <- c(
"A", "R", "N", "D", "C", "E", "Q", "G", "H",
"I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"
)
AAC <- summary(factor(strsplit(x, split = "")[[1]], levels = AADict),
maxsum = 21
) / nchar(x)
AAC
}
aa_count <- function(x) {
AADict <- c(
"A", "R", "N", "D", "C", "E", "Q", "G", "H",
"I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"
)
AAC <- summary(factor(strsplit(x, split = "")[[1]], levels = AADict),
maxsum = 21
)
AAC
}
I can get :
# we need to normalize refseq_aa content with respect
# to the length of seed_pattern, to accommodate the length
# difference between the two.
> refseq_aa_content <- aa_count_normalized(ref_seq) * nchar(seed_pattern)
> refseq_aa_content
A R N D C E
0.0000000 4.7727273 1.3636364 0.0000000 0.6818182 0.0000000
Q G H I L K
2.7272727 3.4090909 2.0454545 0.6818182 2.0454545 1.3636364
M F P S T W
1.3636364 1.3636364 0.6818182 4.0909091 0.6818182 0.6818182
Y V
1.3636364 0.6818182
What I want to do is to replace the wild cards of the seed pattern
- while keeping the non-wildcards as it is - with combinations of residue taken from:
AADict <- c(
"A", "R", "N", "D", "C", "E", "Q", "G", "H",
"I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"
)
such that the mean squared error (MSE) of the final amino acid count of the seed sequence and the normalized reference sequence count is minimized.
With this MSE function:
mse <- function (ref, new_seq) {
return(mean((ref - new_seq)^2))
}
and with this final seed sequences:
seed_final.1 <- aa_count("FKDHKHIDVKDRHRTRHLAKRQQGGGSSSY")
seed_final.2 <- aa_count("FKDHKHIDVKDRHRTRHLAKRQQQGGGSSY")
seed_final.3 <- aa_count("FKDHKHIDVKDRHRTRHLAKSSSGGGRRQQ") # onyambu's
I get
> mse(refseq_aa_content, seed_final.1 )
[1] 1.501446
> mse(refseq_aa_content, seed_final.2 )
[1] 1.63781
> mse(refseq_aa_content, seed_final.3 )
[1] 1.560537
The seed_final.1
is the optimal exact solution, because it has the lowest MSE. Namely the 10 ?
s is to be replaced with:
G Q R S Y
3 2 1 3 1 (total 10)
How can I create an efficient R code to return FKDHKHIDVKDRHRTRHLAKRQQGGGSSSY
as the answer.
You can model your problem as integer quadratic problem where you want to minimize:
sum(r^2) - 2 sum(z * r)
with constraints:
sum(r) = k
r[i] nonegative integer
where:
r[i]
how many ith letters of AADict
you need to add to seed_pattern
z[i] = n(y)/n(x) * x[i] - y[i]
x[i]
counts of ith letter of AADict
in ref_seq
y[i]
counts of ith letter of AADict
in seed_pattern
n(x)
number of characters in ref_seq
n(y)
number of charecters in seed_pattern
k
number of wild card characters in seed_pattern
I din't manage to find mixed-integer quadratic solver in R (free one) so here is heuristic using DEoptimR
:
ref_seq <- "MGHQQLYWSHPRKFGQGSRSCRVTSNRHGLIRKYGLNMSRQSFR"
seed_pattern <- "FKDHKHIDVKDRHRTRHLAK??????????"
AADict <- c(
"A", "R", "N", "D", "C", "E", "Q", "G", "H",
"I", "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V"
)
aminoSummary <- function(x){
f <- factor(strsplit(x, split = "")[[1]], levels = AADict)
list(
l = nchar(x),
k = sum(is.na(f)),
z = table(f)
)
}
x <- aminoSummary(ref_seq)
y <- aminoSummary(seed_pattern)
M <- length(AADict)
res <- DEoptimR::JDEoptim(
lower = rep(0, M),
upper = rep(y$k, M) + 1,
fn = function(r, z, k){
r <- floor(r)
sum(r * r) - 2 * sum(z * r)
},
constr = function(r, z, k) sum(floor(r)) - k,
meq = 1,
z = as.vector(x$z * y$l / x$l - y$z),
k = y$k
)
rep(AADict, floor(res$par))
[1] "R" "Q" "Q" "G" "G" "G" "S" "S" "S" "Y"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With