Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Quantifying frequency of codons in a transmembrane sequence - apply function?

I am trying to look at the codon usage within the transmembrane domains of certain proteins.

To do this, I have the sequences for the TM domain, and I want to search these sequences for how often certain codons appear (the frequency).

Ideally I would like to add new columns to an existing dataframe with the counts for each codon per gene. Like this hypothetical data:

Gene ID TM_domain_Seq AAA CAC GGA
ENSG00000003989 TGGAGCCTCGCTC 0 0 1
ENSG00000003989 TGGAGCCTCGCTC 0 0 1
ENSG00000003989 TGGAGCCTCGCTC 0 0 1
ENSG00000003989 TGGAGCCTCGCTC 0 0 1
ENSG00000003989 TGGAGCCTCGCTC 0 0 1

I have tried the following - creating a function to count how often a particular codon comes up, and applying it to each TM sequence. The problem I am having is how to get a new column added to my data frame for each codon, and how to get the codon frequencies into them.

I have tried for loops, but they take way too long

amino_search <- function(seq) {
  
  count <- str_count(seq, pattern = codons)
  return(count)
}

codon_search <- function(TMseq) {
  
 High_cor$Newcol <- unlist(lapply(TMseq, amino_search))
}

Any help would be greatly appreciated. Thank you!

like image 310
cambio Avatar asked Oct 20 '25 17:10

cambio


1 Answers

Create the vector of possible combinations, then use str_count:

comb <- expand.grid(replicate(3, c("A", "T", "G", "C"), simplify = FALSE)) |>
  apply(MARGIN = 1, FUN = paste, collapse = "")
  #apply(X = _, 1, FUN = paste, collapse = "") #with the new placeholder

df[, comb] <- t(sapply(df$TM_domain_Seq, stringr::str_count, comb))

If you want only in-frame codons, one way to do that is to add a space every three characters:

gsub('(.{3})', '\\1 ', df$TM_domain_Seq[1])
#[1] "TGG AGC CTC GCT C"

df[, comb] <- t(sapply(gsub('(.{3})', '\\1 ', df$TM_domain_Seq), stringr::str_count, comb))

output

# A tibble: 5 × 66
  Gene_ID TM_domain_Seq   AAA   CAC   GGA   TAA   GAA   CAA   ATA   TTA   GTA   CTA   AGA   TGA   CGA   ACA   TCA
  <chr>   <chr>         <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 ENSG00… TGGAGCCTCGCTC     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0
2 ENSG00… TGGAGCCTCGCTC     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0
3 ENSG00… TGGAGCCTCGCTC     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0
4 ENSG00… TGGAGCCTCGCTC     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0
5 ENSG00… TGGAGCCTCGCTC     0     0     1     0     0     0     0     0     0     0     0     0     0     0     0
# … with 49 more variables: GCA <int>, CCA <int>, AAT <int>, TAT <int>, GAT <int>, CAT <int>, ATT <int>,
#   TTT <int>, GTT <int>, CTT <int>, AGT <int>, TGT <int>, GGT <int>, CGT <int>, ACT <int>, TCT <int>,
#   GCT <int>, CCT <int>, AAG <int>, TAG <int>, GAG <int>, CAG <int>, ATG <int>, TTG <int>, GTG <int>,
#   CTG <int>, AGG <int>, TGG <int>, GGG <int>, CGG <int>, ACG <int>, TCG <int>, GCG <int>, CCG <int>,
#   AAC <int>, TAC <int>, GAC <int>, ATC <int>, TTC <int>, GTC <int>, CTC <int>, AGC <int>, TGC <int>,
#   GGC <int>, CGC <int>, ACC <int>, TCC <int>, GCC <int>, CCC <int>
like image 153
Maël Avatar answered Oct 23 '25 05:10

Maël