I have a list of a few hundred amino acid sequences called aa_seq, it looks like this: ['AFYIVHPMFSELINFQNEGHECQCQCG', 'KVHSLPGMSDNGSPAVLPKTEFNKYKI', 'RAQVEDLMSLSPHVENASIPKGSTPIP', 'TSTNNYPMVQEQAILSCIEQTMVADAK',...]. Each sequence is 27 letters long. I have to determine the most used amino acid for every position (1-27) and at what frequency that is.
So far I have:
count_dict = {}
counter = count_dict.values()
aa_list = ['A', 'C', 'D', 'E' ,'F' ,'G' ,'H' ,'I' ,'K' ,'L' , #one-letter code for amino acids
'M' ,'N' ,'P' ,'Q' ,'R' ,'S' ,'T' ,'V' ,'W' ,'Y']
for p in range(0,26): #first round:looks at the first position in each sequence
for s in range(0,len(aa_seq)): #goes through all sequences of the list
for item in aa_list: #and checks for the occurrence of each amino acid letter (=item)
if item in aa_seq[s][p]:
count_dict[item] #if that letter occurs at the respective position, make it a key in the dictionary
counter += 1 #and increase its counter (the value, as definded above) by one
print count_dict
It says KeyError: 'A', and it's pointing to the line count_dict[item]. So the item of the aa_list apparently can't be added as a key this way..? How do I do that? And it also gave an error "'int' object is not iterable" concerning the counter. How else can the counter be increased?
You could use the Counter class
>>> from collections import Counter
>>> l = ['AFYIVHPMFSELINFQNEGHECQCQCG', 'KVHSLPGMSDNGSPAVLPKTEFNKYKI', 'RAQVEDLMSLSPHVENASIPKGSTPIP', 'TSTNNYPMVQEQAILSCIEQTMVADAK']
>>> s = [Counter([l[j][i] for j in range(len(l))]).most_common()[0] for i in range(27)]
>>> s
[('A', 1),
('A', 1),
('Y', 1),
('I', 1),
('N', 1),
('Y', 1),
('P', 2),
('M', 4),
('S', 2),
('Q', 1),
('E', 2),
('Q', 1),
('I', 1),
('I', 1),
('A', 1),
('Q', 1),
('A', 1),
('I', 1),
('I', 1),
('Q', 1),
('E', 2),
('C', 1),
('Q', 1),
('A', 1),
('Q', 1),
('I', 1),
('I', 1)]
However i might be way to inefficient if you have large data sets.
Here's a modified, working version of your code. It's not efficient but it should output the correct result.
A few notes :
range(0,26) only has 26 elements : from 0 to 25 (inclusive).defaultdict helps you define 0 for each start value.count_dict[item] += 1from collections import defaultdict
aa_seq = ['AFYIVHPMFSELINFQNEGHECQCQCG', 'KVHSLPGMSDNGSPAVLPKTEFNKYKI',
'RAQVEDLMSLSPHVENASIPKGSTPIP', 'TSTNNYPMVQEQAILSCIEQTMVADAK']
aa_list = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', # one-letter code for amino acids
'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']
for p in range(27): # first round:looks at the first position in each sequence
count_dict = defaultdict(int) # initialize counter with 0 as default value
for s in range(0, len(aa_seq)): # goes through all sequences of the list
# and checks for the occurrence of each amino acid letter (=item)
for item in aa_list:
if item in aa_seq[s][p]:
# if that letter occurs at the respective position, make it a
# key in the dictionary
count_dict[item] += 1
print(max(count_dict.items(), key=lambda x: x[1]))
It outputs :
('R', 1)
('S', 1)
('Y', 1)
('S', 1)
('E', 1)
('P', 1)
('P', 2)
('M', 4)
...
You don't need that many loops, you just need to iterate once over every character of every sequence.
Also, there's no need to reinvent the wheel: Counter and most_common are better alternatives than defaultdict and max.
from collections import Counter
aa_seqs = ['AFYIVHPMFSELINFQNEGHECQCQCG', 'KVHSLPGMSDNGSPAVLPKTEFNKYKI', 'RAQVEDLMSLSPHVENASIPKGSTPIP', 'TSTNNYPMVQEQAILSCIEQTMVADAK']
counters = [Counter() for i in range(27)]
for aa_seq in aa_seqs:
for (i, aa) in enumerate(aa_seq):
counters[i][aa] += 1
most_commons = [counter.most_common()[0] for counter in counters]
print(most_commons)
It outputs :
[('K', 1), ('A', 1), ('Y', 1), ('N', 1), ('N', 1), ('Y', 1), ('P', 2), ('M', 4), ('S', 2), ('Q', 1), ('E', 2), ('G', 1), ('H', 1), ('N', 1), ('L', 1), ('N', 1), ('N', 1), ('I', 1), ('G', 1), ('H', 1), ('E', 2), ('G', 1), ('N', 1), ('K', 1), ('Y', 1), ('K', 1), ('G', 1)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With