Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count amino acids in fasta formated file?

I found the code that parses the fasta frmated file. I need to count how many A, T, G and so on is in each sequence, for example:

>gi|7290019|gb|AAF45486.1| (AE003417) EG:BACR37P7.1 gene product [Drosophila melanogaster]
MRMRGRRLLPIIL 

in this sequence theres:

M - 2
R - 4
G - 1
L - 3
I - 2
P - 1

The code is very simple:

 def FASTA(filename):
  try:
    f = file(filename)
  except IOError:                     
    print "The file, %s, does not exist" % filename
    return

  order = []
  sequences = {}

  for line in f:
    if line.startswith('>'):
      name = line[1:].rstrip('\n')
      name = name.replace('_', ' ')
      order.append(name)
      sequences[name] = ''
    else:
      sequences[name] += line.rstrip('\n').rstrip('*')

  print "%d sequences found" % len(order)
  return order, sequences

x, y = FASTA("drosoph_b.fasta")

but how can I count those amino acids? I dont want to use BioPython, I would like to know how to do this with, for example count...

like image 217
mazix Avatar asked Jan 28 '26 04:01

mazix


2 Answers

As katrielalex points out, collections.Counter is a great fit for the task:

In [1]: from collections import Counter

In [2]: Counter('MRMRGRRLLPIIL')
Out[2]: Counter({'R': 4, 'L': 3, 'M': 2, 'I': 2, 'G': 1, 'P': 1})

You can apply it to the values of sequences dict in your code.

However, I'd recommend against using this code in real life. Libraries such as BioPython do it better. For example, the code you show will generate quite bulky data structures. As FASTA files are sometimes quite large, you can possibly run out of memory. Also, it doesn't handle possible exceptions in the best way possible.

Finally, using libraries just saves you time.

Example code with BioPython:

In [1]: from Bio import SeqIO

In [2]: from collections import Counter

In [3]: for entry in SeqIO.parse('1.fasta', 'fasta'):
   ...:     print Counter(entry.seq)
   ...:     
Counter({'R': 4, 'L': 3, 'I': 2, 'M': 2, 'G': 1, 'P': 1})
like image 159
Lev Levitsky Avatar answered Jan 29 '26 20:01

Lev Levitsky


This can be obtained using very simple bash commands, my answer is below

cat input.fasta #my input file
>gi|7290019|gb|AAF45486.1| (AE003417) EG:BACR37P7.1 gene product [Drosophila melanogaster]
    MRMRGRRLLPIIL

cat input.fasta | grep -v ">" | fold -w1 | sort | uniq -c

Output:

   1 G
   2 I
   3 L
   2 M
   1 P
   4 R

fold -w1 splits at every character, you sort them and count the unique ones

like image 41
bala Avatar answered Jan 29 '26 19:01

bala