I am trying to look at the codon usage within the transmembrane domains of certain proteins.
To do this, I have the sequences for the TM domain, and I want to search these sequences for how often certain codons appear (the frequency).
Ideally I would like to add new columns to an existing dataframe with the counts for each codon per gene. Like this hypothetical data:
| Gene ID | TM_domain_Seq | AAA | CAC | GGA |
|---|---|---|---|---|
| ENSG00000003989 | TGGAGCCTCGCTC | 0 | 0 | 1 |
| ENSG00000003989 | TGGAGCCTCGCTC | 0 | 0 | 1 |
| ENSG00000003989 | TGGAGCCTCGCTC | 0 | 0 | 1 |
| ENSG00000003989 | TGGAGCCTCGCTC | 0 | 0 | 1 |
| ENSG00000003989 | TGGAGCCTCGCTC | 0 | 0 | 1 |
I have tried the following - creating a function to count how often a particular codon comes up, and applying it to each TM sequence. The problem I am having is how to get a new column added to my data frame for each codon, and how to get the codon frequencies into them.
I have tried for loops, but they take way too long
amino_search <- function(seq) {
count <- str_count(seq, pattern = codons)
return(count)
}
codon_search <- function(TMseq) {
High_cor$Newcol <- unlist(lapply(TMseq, amino_search))
}
Any help would be greatly appreciated. Thank you!
Create the vector of possible combinations, then use str_count:
comb <- expand.grid(replicate(3, c("A", "T", "G", "C"), simplify = FALSE)) |>
apply(MARGIN = 1, FUN = paste, collapse = "")
#apply(X = _, 1, FUN = paste, collapse = "") #with the new placeholder
df[, comb] <- t(sapply(df$TM_domain_Seq, stringr::str_count, comb))
If you want only in-frame codons, one way to do that is to add a space every three characters:
gsub('(.{3})', '\\1 ', df$TM_domain_Seq[1])
#[1] "TGG AGC CTC GCT C"
df[, comb] <- t(sapply(gsub('(.{3})', '\\1 ', df$TM_domain_Seq), stringr::str_count, comb))
output
# A tibble: 5 × 66
Gene_ID TM_domain_Seq AAA CAC GGA TAA GAA CAA ATA TTA GTA CTA AGA TGA CGA ACA TCA
<chr> <chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 ENSG00… TGGAGCCTCGCTC 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
2 ENSG00… TGGAGCCTCGCTC 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
3 ENSG00… TGGAGCCTCGCTC 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
4 ENSG00… TGGAGCCTCGCTC 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
5 ENSG00… TGGAGCCTCGCTC 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
# … with 49 more variables: GCA <int>, CCA <int>, AAT <int>, TAT <int>, GAT <int>, CAT <int>, ATT <int>,
# TTT <int>, GTT <int>, CTT <int>, AGT <int>, TGT <int>, GGT <int>, CGT <int>, ACT <int>, TCT <int>,
# GCT <int>, CCT <int>, AAG <int>, TAG <int>, GAG <int>, CAG <int>, ATG <int>, TTG <int>, GTG <int>,
# CTG <int>, AGG <int>, TGG <int>, GGG <int>, CGG <int>, ACG <int>, TCG <int>, GCG <int>, CCG <int>,
# AAC <int>, TAC <int>, GAC <int>, ATC <int>, TTC <int>, GTC <int>, CTC <int>, AGC <int>, TGC <int>,
# GGC <int>, CGC <int>, ACC <int>, TCC <int>, GCC <int>, CCC <int>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With