Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sequence coverage using R

I have a protein sequence with 100 aminoacids (AA) that can be handled as a data.frame. Each AA has a position and for now all that matters is the position:

Protein <- data.frame(AA = 1:100)

Than I have a data.frame with peptides from the protein (after digestion / sequence breakdown) with Initial and Final position of the AA related to the protein:

df <- data.frame(
Peptides = c("Peptide_A", "Peptide_B", "Peptide_C", "Peptide_D"), 
Initial.AA = c(1, 23, 59, 77), 
Final.AA = c(18, 58, 70, 100)
)

Output:

   Peptides Initial.AA Final.AA
1 Peptide_A          1       18
2 Peptide_B         23       58
3 Peptide_C         59       70
4 Peptide_D         77      100

Inspecting df it´s clear that some AA were not mapped (19:22 and 71:76, total of 10 unmapped AA).

I would like the have as output the total percentual of mapped AA, which in this example is 90% (90 mapped AA from all the peptides / 100 protein AA).

All answers are welcome as always, but tidyverse ones are prefered.

like image 570
T.B. Avatar asked Dec 07 '25 07:12

T.B.


1 Answers

This solution should work even when df$Initial.AA does not start at 1:

library(dplyr)
library(tidyr)

df <- 
  rowwise(df) |> 
  mutate(seq = list(seq(Initial.AA, Final.AA, by = 1))) |> 
  unnest(seq)

1 - sum(!Protein$AA %in% df$seq)/length(Protein$AA)
#> [1] 0.9

Created on 2024-04-16 with reprex v2.1.0

like image 112
Captain Hat Avatar answered Dec 08 '25 21:12

Captain Hat



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!