How to count amino acids in fasta formated file?

Question

I found the code that parses the fasta frmated file. I need to count how many A, T, G and so on is in each sequence, for example:

>gi|7290019|gb|AAF45486.1| (AE003417) EG:BACR37P7.1 gene product [Drosophila melanogaster]
MRMRGRRLLPIIL

in this sequence theres:

M - 2
R - 4
G - 1
L - 3
I - 2
P - 1

The code is very simple:

 def FASTA(filename):
  try:
    f = file(filename)
  except IOError:                     
    print "The file, %s, does not exist" % filename
    return

  order = []
  sequences = {}

  for line in f:
    if line.startswith('>'):
      name = line[1:].rstrip('
')
      name = name.replace('_', ' ')
      order.append(name)
      sequences[name] = ''
    else:
      sequences[name] += line.rstrip('
').rstrip('*')

  print "%d sequences found" % len(order)
  return order, sequences

x, y = FASTA("drosoph_b.fasta")

but how can I count those amino acids? I dont want to use BioPython, I would like to know how to do this with, for example count...

Lev Levitsky · Accepted Answer

As katrielalex points out, collections.Counter is a great fit for the task:

In [1]: from collections import Counter

In [2]: Counter('MRMRGRRLLPIIL')
Out[2]: Counter({'R': 4, 'L': 3, 'M': 2, 'I': 2, 'G': 1, 'P': 1})

You can apply it to the values of sequences dict in your code.

However, I'd recommend against using this code in real life. Libraries such as BioPython do it better. For example, the code you show will generate quite bulky data structures. As FASTA files are sometimes quite large, you can possibly run out of memory. Also, it doesn't handle possible exceptions in the best way possible.

Finally, using libraries just saves you time.

Example code with BioPython:

In [1]: from Bio import SeqIO

In [2]: from collections import Counter

In [3]: for entry in SeqIO.parse('1.fasta', 'fasta'):
   ...:     print Counter(entry.seq)
   ...:     
Counter({'R': 4, 'L': 3, 'I': 2, 'M': 2, 'G': 1, 'P': 1})

bala · Answer

This can be obtained using very simple bash commands, my answer is below

cat input.fasta #my input file
>gi|7290019|gb|AAF45486.1| (AE003417) EG:BACR37P7.1 gene product [Drosophila melanogaster]
    MRMRGRRLLPIIL

cat input.fasta | grep -v ">" | fold -w1 | sort | uniq -c

Output:

fold -w1 splits at every character, you sort them and count the unique ones

How to count amino acids in fasta formated file?

Tags:

python

bioinformatics

biopython

fasta

mazix

2 Answers

Lev Levitsky

bala

Recent Activity

Donate For Us

How to count amino acids in fasta formated file?

Tags:

python

bioinformatics

biopython

fasta

mazix

2 Answers

Lev Levitsky

bala

Related questions

Recent Activity

Donate For Us