Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Defining a function that counts relative frequency of Amino Acids

I'm trying to calculate codon frequency within a given sequence of DNA.

For example:

sequence = 'ATGAAGAAA'
codons = ['ATG', 'AAG', 'AAA']

for XX in codons:

frequency  = codons.count(XX)/(codons.count(XX)+codons.count(XX2)+codons.count(XX3))

Note that XX2 and XX3 will not always be in the sequence. Some codons may or may not have multiple codons.

Example: Lysine has 2 codons, AAA and AAG

so the frequency of

AAA = codons.count('AAA')/(codons.count('AAA') + codons.count('AAG'))

How can I do this for EVERY codon in the list? How do I account for multiple codons?

like image 849
Francis Avatar asked Feb 03 '26 13:02

Francis


2 Answers

use defaultdict

from collections import defaultdict

mydict = defaultdict(int)

for aa in mysecuence:
    mydict[aa] +=1

This works for aminoacids (proteins).
For codons, you should iterate on the sequence in 3 positions steps to get the keys of the defaultdict. For example:

>>> mysec = "GAUCACTUGCCA"
>>> a = [mysec[i:i+3] for i in range(0,len(mysec), 3)]
>>> print a


['GAU', 'CAC', 'TUG', 'CCA']

EDIT: If you want to calculate degeneration, you should prepare a dictionary relating each codon (key) with its degenerated codons (value, list of codons). To calculate the frecuency, from the defaultdict you can get the counts for each codon, then for each codon you calculate the sum of the counts of the degenerated codons read from the dictionary of codons indicated above. Then you can calculate the frecuency.

EDIT 2: Here you have a real example:

from collections import defaultdict

#the first 600 nucleotides from GenBank: AAHX01097212.1
rna = ("tcccccgcagcttcgggaacgtgcgggctcgggagggaggggcctggcgccgggcgcgcg"
       "cctgcgccccaccccgccccaccctggcgggtctcgcgcgcccggcccgcctcctgtcaa"
       "ccccagcgcggcggtcaggtggtccccagcccttggccccagcctccagcttcctggtcc"
       "ctcgggctctgagtcctgtctccggcagatcgcctttctgattgttctcctgcgcagctg"
       "gaggtgtatagcccctagccgagctatggtgcctcagcagatgtgaggaggtagtgggtc"
       "aggataaacccgcgcactccataataacgtgccagggctcagtgacttgggtctgcatta")

seq = rna.upper().replace('T', 'U')

#RNA codon table from http://en.wikipedia.org/wiki/Genetic_code
degenerated = (('GCU', 'GCC', 'GCA', 'GCG'),
               ('UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG'),
               ('CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),
               ('AAA', 'AAG'), ('AAU', 'AAC'), ('GAU', 'GAC'),
               ('UUU', 'UUC'), ('UGU', 'UGC'), ('CCU', 'CCC', 'CCA', 'CCG'),
               ('CAA', 'CAG'), ('UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC'),
               ('GAA', 'GAG'), ('ACU', 'ACC', 'ACA', 'ACG'),
               ('GGU', 'GGC', 'GGA', 'GGG'), ('CAU', 'CAC'), ('UAU', 'UAC'),
               ('AUU', 'AUC', 'AUA'), ('GUU', 'GUC', 'GUA', 'GUG'),
               ('UAA', 'UGA', 'UAG'))

#prepare the dictio of degenerated codons
degen_dict = {}
for codons in degenerated:
    for codon in codons:
        degen_dict[codon] = codons

#query_codons
max_seq = len(seq)
query_codons = [seq[i:i+3] for i in range(0, max_seq, 3)]

#prepare dictio of counts:
counts = defaultdict(int)
for codon in query_codons:
    counts[codon] +=1

#actual calculation of frecuencies
data = {}
for codon in query_codons:
    if codon in  degen_dict:
        totals = sum(counts[deg] for deg in degen_dict[codon])
        frecuency = float(counts[codon]) / totals
    else:
        frecuency = 1.00

    data[codon] = frecuency

#print results
for codon, frecuency in data.iteritems():
    print "%s  -> %.2f" %(codon, frecuency)


#produces:
GUC  -> 0.57
AUA  -> 1.00
ACG  -> 0.50
AAC  -> 1.00
CCU  -> 0.25
UAU  -> 1.00
..........
GCU  -> 0.19
GAU  -> 1.00
UAG  -> 0.33
CUC  -> 0.38
UUA  -> 0.13
UGA  -> 0.33
like image 102
joaquin Avatar answered Feb 06 '26 05:02

joaquin


If your sequence is in the correct reading frame:

>>> import collections
>>> 
>>> code = {     'ttt': 'F', 'tct': 'S', 'tat': 'Y', 'tgt': 'C',
...              'ttc': 'F', 'tcc': 'S', 'tac': 'Y', 'tgc': 'C',
...              'tta': 'L', 'tca': 'S', 'taa': '*', 'tga': '*',
...              'ttg': 'L', 'tcg': 'S', 'tag': '*', 'tgg': 'W',
...              'ctt': 'L', 'cct': 'P', 'cat': 'H', 'cgt': 'R',
...              'ctc': 'L', 'ccc': 'P', 'cac': 'H', 'cgc': 'R',
...              'cta': 'L', 'cca': 'P', 'caa': 'Q', 'cga': 'R',
...              'ctg': 'L', 'ccg': 'P', 'cag': 'Q', 'cgg': 'R',
...              'att': 'I', 'act': 'T', 'aat': 'N', 'agt': 'S',
...              'atc': 'I', 'acc': 'T', 'aac': 'N', 'agc': 'S',
...              'ata': 'I', 'aca': 'T', 'aaa': 'K', 'aga': 'R',
...              'atg': 'M', 'acg': 'T', 'aag': 'K', 'agg': 'R',
...              'gtt': 'V', 'gct': 'A', 'gat': 'D', 'ggt': 'G',
...              'gtc': 'V', 'gcc': 'A', 'gac': 'D', 'ggc': 'G',
...              'gta': 'V', 'gca': 'A', 'gaa': 'E', 'gga': 'G',
...              'gtg': 'V', 'gcg': 'A', 'gag': 'E', 'ggg': 'G'
...         }
>>> def count_codons(cds):
...     counts = collections.defaultdict(int)
...     for i in range(0,len(cds),3):
...        codon = cds[i:i+3]
...        counts[codon] += 1
...     return counts
... 
>>> def translate(cds, code):
...     return "".join((code[cds[i:i+3]] for i in range(0, len(cds), 3)))
... 
>>> seq = 'ATGAAGAAA'
>>> 
>>> codon_counts = count_codons(seq.lower())
>>> trans_seq = translate(seq.lower(), code)
>>> 
>>> [(codon, code[codon], float(codon_counts[codon])/trans_seq.count(code[codon])) for codon in codon_counts.keys()]
[('atg', 'M', 1.0), ('aag', 'K', 0.5), ('aaa', 'K', 0.5)]
>>> 

other info:

I think you are asking to find something called codon usage.

There are tools online which allow you to find codon usage. This one also allows for offline use.

http://www.bioinformatics.org/sms2/codon_usage.html

and results (in this 'Fraction' is what you are asking for):

Results for 9 residue sequence "sample sequence one" starting "ATGAAGAAA"
AmAcid   Codon     Number        /1000     Fraction   .. 

Ala      GCG         0.00         0.00         0.00 
Ala      GCA         0.00         0.00         0.00 
Ala      GCT         0.00         0.00         0.00 
Ala      GCC         0.00         0.00         0.00 

Cys      TGT         0.00         0.00         0.00 
Cys      TGC         0.00         0.00         0.00 

Asp      GAT         0.00         0.00         0.00 
Asp      GAC         0.00         0.00         0.00 

Glu      GAG         0.00         0.00         0.00 
Glu      GAA         0.00         0.00         0.00 

Phe      TTT         0.00         0.00         0.00 
Phe      TTC         0.00         0.00         0.00 

Gly      GGG         0.00         0.00         0.00 
Gly      GGA         0.00         0.00         0.00 
Gly      GGT         0.00         0.00         0.00 
Gly      GGC         0.00         0.00         0.00 

His      CAT         0.00         0.00         0.00 
His      CAC         0.00         0.00         0.00 

Ile      ATA         0.00         0.00         0.00 
Ile      ATT         0.00         0.00         0.00 
Ile      ATC         0.00         0.00         0.00 

Lys      AAG         1.00       333.33         0.50 
Lys      AAA         1.00       333.33         0.50 

Leu      TTG         0.00         0.00         0.00 
Leu      TTA         0.00         0.00         0.00 
Leu      CTG         0.00         0.00         0.00 
Leu      CTA         0.00         0.00         0.00 
Leu      CTT         0.00         0.00         0.00 
Leu      CTC         0.00         0.00         0.00 

Met      ATG         1.00       333.33         1.00 

Asn      AAT         0.00         0.00         0.00 
Asn      AAC         0.00         0.00         0.00 

Pro      CCG         0.00         0.00         0.00 
Pro      CCA         0.00         0.00         0.00 
Pro      CCT         0.00         0.00         0.00 
Pro      CCC         0.00         0.00         0.00 

Gln      CAG         0.00         0.00         0.00 
Gln      CAA         0.00         0.00         0.00 

Arg      AGG         0.00         0.00         0.00 
Arg      AGA         0.00         0.00         0.00 
Arg      CGG         0.00         0.00         0.00 
Arg      CGA         0.00         0.00         0.00 
Arg      CGT         0.00         0.00         0.00 
Arg      CGC         0.00         0.00         0.00 

Ser      AGT         0.00         0.00         0.00 
Ser      AGC         0.00         0.00         0.00 
Ser      TCG         0.00         0.00         0.00 
Ser      TCA         0.00         0.00         0.00 
Ser      TCT         0.00         0.00         0.00 
Ser      TCC         0.00         0.00         0.00 

Thr      ACG         0.00         0.00         0.00 
Thr      ACA         0.00         0.00         0.00 
Thr      ACT         0.00         0.00         0.00 
Thr      ACC         0.00         0.00         0.00 

Val      GTG         0.00         0.00         0.00 
Val      GTA         0.00         0.00         0.00 
Val      GTT         0.00         0.00         0.00 
Val      GTC         0.00         0.00         0.00 

Trp      TGG         0.00         0.00         0.00 

Tyr      TAT         0.00         0.00         0.00 
Tyr      TAC         0.00         0.00         0.00 

End      TGA         0.00         0.00         0.00 
End      TAG         0.00         0.00         0.00 
End      TAA         0.00         0.00         0.00 

cusp is the codon usage tool from EMBOSS which also may be worth taking a look at.

You may want to checkout BioPython for working with biological sequences. I believe they have a codon usage module.

like image 37
dting Avatar answered Feb 06 '26 05:02

dting