I'm trying to calculate codon frequency within a given sequence of DNA.
For example:
sequence = 'ATGAAGAAA'
codons = ['ATG', 'AAG', 'AAA']
for XX in codons:
frequency = codons.count(XX)/(codons.count(XX)+codons.count(XX2)+codons.count(XX3))
Note that XX2 and XX3 will not always be in the sequence. Some codons may or may not have multiple codons.
Example: Lysine has 2 codons, AAA and AAG
so the frequency of
AAA = codons.count('AAA')/(codons.count('AAA') + codons.count('AAG'))
How can I do this for EVERY codon in the list? How do I account for multiple codons?
use defaultdict
from collections import defaultdict
mydict = defaultdict(int)
for aa in mysecuence:
mydict[aa] +=1
This works for aminoacids (proteins).
For codons, you should iterate on the sequence in 3 positions steps to get the keys of the defaultdict. For example:
>>> mysec = "GAUCACTUGCCA"
>>> a = [mysec[i:i+3] for i in range(0,len(mysec), 3)]
>>> print a
['GAU', 'CAC', 'TUG', 'CCA']
EDIT: If you want to calculate degeneration, you should prepare a dictionary relating each codon (key) with its degenerated codons (value, list of codons). To calculate the frecuency, from the defaultdict you can get the counts for each codon, then for each codon you calculate the sum of the counts of the degenerated codons read from the dictionary of codons indicated above. Then you can calculate the frecuency.
EDIT 2: Here you have a real example:
from collections import defaultdict
#the first 600 nucleotides from GenBank: AAHX01097212.1
rna = ("tcccccgcagcttcgggaacgtgcgggctcgggagggaggggcctggcgccgggcgcgcg"
"cctgcgccccaccccgccccaccctggcgggtctcgcgcgcccggcccgcctcctgtcaa"
"ccccagcgcggcggtcaggtggtccccagcccttggccccagcctccagcttcctggtcc"
"ctcgggctctgagtcctgtctccggcagatcgcctttctgattgttctcctgcgcagctg"
"gaggtgtatagcccctagccgagctatggtgcctcagcagatgtgaggaggtagtgggtc"
"aggataaacccgcgcactccataataacgtgccagggctcagtgacttgggtctgcatta")
seq = rna.upper().replace('T', 'U')
#RNA codon table from http://en.wikipedia.org/wiki/Genetic_code
degenerated = (('GCU', 'GCC', 'GCA', 'GCG'),
('UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG'),
('CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),
('AAA', 'AAG'), ('AAU', 'AAC'), ('GAU', 'GAC'),
('UUU', 'UUC'), ('UGU', 'UGC'), ('CCU', 'CCC', 'CCA', 'CCG'),
('CAA', 'CAG'), ('UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC'),
('GAA', 'GAG'), ('ACU', 'ACC', 'ACA', 'ACG'),
('GGU', 'GGC', 'GGA', 'GGG'), ('CAU', 'CAC'), ('UAU', 'UAC'),
('AUU', 'AUC', 'AUA'), ('GUU', 'GUC', 'GUA', 'GUG'),
('UAA', 'UGA', 'UAG'))
#prepare the dictio of degenerated codons
degen_dict = {}
for codons in degenerated:
for codon in codons:
degen_dict[codon] = codons
#query_codons
max_seq = len(seq)
query_codons = [seq[i:i+3] for i in range(0, max_seq, 3)]
#prepare dictio of counts:
counts = defaultdict(int)
for codon in query_codons:
counts[codon] +=1
#actual calculation of frecuencies
data = {}
for codon in query_codons:
if codon in degen_dict:
totals = sum(counts[deg] for deg in degen_dict[codon])
frecuency = float(counts[codon]) / totals
else:
frecuency = 1.00
data[codon] = frecuency
#print results
for codon, frecuency in data.iteritems():
print "%s -> %.2f" %(codon, frecuency)
#produces:
GUC -> 0.57
AUA -> 1.00
ACG -> 0.50
AAC -> 1.00
CCU -> 0.25
UAU -> 1.00
..........
GCU -> 0.19
GAU -> 1.00
UAG -> 0.33
CUC -> 0.38
UUA -> 0.13
UGA -> 0.33
If your sequence is in the correct reading frame:
>>> import collections
>>>
>>> code = { 'ttt': 'F', 'tct': 'S', 'tat': 'Y', 'tgt': 'C',
... 'ttc': 'F', 'tcc': 'S', 'tac': 'Y', 'tgc': 'C',
... 'tta': 'L', 'tca': 'S', 'taa': '*', 'tga': '*',
... 'ttg': 'L', 'tcg': 'S', 'tag': '*', 'tgg': 'W',
... 'ctt': 'L', 'cct': 'P', 'cat': 'H', 'cgt': 'R',
... 'ctc': 'L', 'ccc': 'P', 'cac': 'H', 'cgc': 'R',
... 'cta': 'L', 'cca': 'P', 'caa': 'Q', 'cga': 'R',
... 'ctg': 'L', 'ccg': 'P', 'cag': 'Q', 'cgg': 'R',
... 'att': 'I', 'act': 'T', 'aat': 'N', 'agt': 'S',
... 'atc': 'I', 'acc': 'T', 'aac': 'N', 'agc': 'S',
... 'ata': 'I', 'aca': 'T', 'aaa': 'K', 'aga': 'R',
... 'atg': 'M', 'acg': 'T', 'aag': 'K', 'agg': 'R',
... 'gtt': 'V', 'gct': 'A', 'gat': 'D', 'ggt': 'G',
... 'gtc': 'V', 'gcc': 'A', 'gac': 'D', 'ggc': 'G',
... 'gta': 'V', 'gca': 'A', 'gaa': 'E', 'gga': 'G',
... 'gtg': 'V', 'gcg': 'A', 'gag': 'E', 'ggg': 'G'
... }
>>> def count_codons(cds):
... counts = collections.defaultdict(int)
... for i in range(0,len(cds),3):
... codon = cds[i:i+3]
... counts[codon] += 1
... return counts
...
>>> def translate(cds, code):
... return "".join((code[cds[i:i+3]] for i in range(0, len(cds), 3)))
...
>>> seq = 'ATGAAGAAA'
>>>
>>> codon_counts = count_codons(seq.lower())
>>> trans_seq = translate(seq.lower(), code)
>>>
>>> [(codon, code[codon], float(codon_counts[codon])/trans_seq.count(code[codon])) for codon in codon_counts.keys()]
[('atg', 'M', 1.0), ('aag', 'K', 0.5), ('aaa', 'K', 0.5)]
>>>
other info:
I think you are asking to find something called codon usage.
There are tools online which allow you to find codon usage. This one also allows for offline use.
http://www.bioinformatics.org/sms2/codon_usage.html
and results (in this 'Fraction' is what you are asking for):
Results for 9 residue sequence "sample sequence one" starting "ATGAAGAAA"
AmAcid Codon Number /1000 Fraction ..
Ala GCG 0.00 0.00 0.00
Ala GCA 0.00 0.00 0.00
Ala GCT 0.00 0.00 0.00
Ala GCC 0.00 0.00 0.00
Cys TGT 0.00 0.00 0.00
Cys TGC 0.00 0.00 0.00
Asp GAT 0.00 0.00 0.00
Asp GAC 0.00 0.00 0.00
Glu GAG 0.00 0.00 0.00
Glu GAA 0.00 0.00 0.00
Phe TTT 0.00 0.00 0.00
Phe TTC 0.00 0.00 0.00
Gly GGG 0.00 0.00 0.00
Gly GGA 0.00 0.00 0.00
Gly GGT 0.00 0.00 0.00
Gly GGC 0.00 0.00 0.00
His CAT 0.00 0.00 0.00
His CAC 0.00 0.00 0.00
Ile ATA 0.00 0.00 0.00
Ile ATT 0.00 0.00 0.00
Ile ATC 0.00 0.00 0.00
Lys AAG 1.00 333.33 0.50
Lys AAA 1.00 333.33 0.50
Leu TTG 0.00 0.00 0.00
Leu TTA 0.00 0.00 0.00
Leu CTG 0.00 0.00 0.00
Leu CTA 0.00 0.00 0.00
Leu CTT 0.00 0.00 0.00
Leu CTC 0.00 0.00 0.00
Met ATG 1.00 333.33 1.00
Asn AAT 0.00 0.00 0.00
Asn AAC 0.00 0.00 0.00
Pro CCG 0.00 0.00 0.00
Pro CCA 0.00 0.00 0.00
Pro CCT 0.00 0.00 0.00
Pro CCC 0.00 0.00 0.00
Gln CAG 0.00 0.00 0.00
Gln CAA 0.00 0.00 0.00
Arg AGG 0.00 0.00 0.00
Arg AGA 0.00 0.00 0.00
Arg CGG 0.00 0.00 0.00
Arg CGA 0.00 0.00 0.00
Arg CGT 0.00 0.00 0.00
Arg CGC 0.00 0.00 0.00
Ser AGT 0.00 0.00 0.00
Ser AGC 0.00 0.00 0.00
Ser TCG 0.00 0.00 0.00
Ser TCA 0.00 0.00 0.00
Ser TCT 0.00 0.00 0.00
Ser TCC 0.00 0.00 0.00
Thr ACG 0.00 0.00 0.00
Thr ACA 0.00 0.00 0.00
Thr ACT 0.00 0.00 0.00
Thr ACC 0.00 0.00 0.00
Val GTG 0.00 0.00 0.00
Val GTA 0.00 0.00 0.00
Val GTT 0.00 0.00 0.00
Val GTC 0.00 0.00 0.00
Trp TGG 0.00 0.00 0.00
Tyr TAT 0.00 0.00 0.00
Tyr TAC 0.00 0.00 0.00
End TGA 0.00 0.00 0.00
End TAG 0.00 0.00 0.00
End TAA 0.00 0.00 0.00
cusp is the codon usage tool from EMBOSS which also may be worth taking a look at.
You may want to checkout BioPython for working with biological sequences. I believe they have a codon usage module.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With