Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bioinformatics: Find Genes given a Genome String

Biologists use a sequence of letters A, C, T, and G to model a genome. A gene is a substrsing of a genome that starts after a triplet ATG and ends before a triplet TAG, TAA, or TGA. Furthermore, the length of a gene string is a multiple of 3 and the gene does not contain any of the triplets ATG, TAG, TAA, and TGA.

Ideally:

Enter a genome string: TTATGTTTTAAGGATGGGGCGTTAGTT #Enter   
TTT
GGGCGT
-----------------
Enter a genome string: TGTGTGTATAT
No Genes Were Found

So far, I have:

def findGene(gene):
    final = ""
    genep = gene.split("ATG")
    for part in genep:
        for chr in part:
            for i in range(0, len(chr)):
                if genePool(chr[i:i + 3]) == 1:
                    break
                else:
                    final += (chr[i+i + 3] + "\n")
    return final

def genePool(part):
    g1 = "ATG"
    g2 = "TAG"
    g3 = "TAA"
    g4 = "TGA"
    if (part.count(g1) != 0) or (part.count(g2) != 0) or (part.count(g3) != 0) or (part.count(g4) != 0):
        return 1

def main():
    geneinput = input("Enter a genome string: ")
    print(findGene(geneinput))

main()
# TTATGTTTTAAGGATGGGGCGTTAGTT

I keep running into errors

To be completely honest, this is really not working for me - I think I have hit a dead end with these lines of code - a new approach may be helpful.

Thanks in advance!

The error that I have been getting -

Enter a genome string: TTATGTTTTAAGGATGGGGCGTTAGTT
Traceback (most recent call last):
  File "D:\Python\Chapter 8\Bioinformatics.py", line 40, in <module>
    main()
  File "D:\Python\Chapter 8\Bioinformatics.py", line 38, in main
    print(findGene(geneinput))
  File "D:\Python\Chapter 8\Bioinformatics.py", line 25, in findGene
    final += (chr[i+i + 3] + "\n")
IndexError: string index out of range

Like I said before, I'm not really sure if I am on the right track to solve the issue with my current code - any new ideas w/ pseudo code is appreciated!

like image 242
Matt Rumbel Avatar asked Nov 22 '25 17:11

Matt Rumbel


1 Answers

This can be done with a regular expression:

import re

pattern = re.compile(r'ATG((?:[ACTG]{3})+?)(?:TAG|TAA|TGA)')
pattern.findall('TTATGTTTTAAGGATGGGGCGTTAGTT')
pattern.findall('TGTGTGTATAT')

Output

['TTT', 'GGGCGT']
[]

Explanation extracted from https://regex101.com/r/yI4tN9/3

"ATG((?:[ACTG]{3})+?)(?:TAG|TAA|TGA)"g
    ATG matches the characters ATG literally (case sensitive)
    1st Capturing group ((?:[ACTG]{3})+?)
        (?:[ACTG]{3})+? Non-capturing group
            Quantifier: +? Between one and unlimited times, as few times as possible, expanding as needed [lazy]
            [ACTG]{3} match a single character present in the list below
                Quantifier: {3} Exactly 3 times
                ACTG a single character in the list ACTG literally (case sensitive)
    (?:TAG|TAA|TGA) Non-capturing group
        1st Alternative: TAG
            TAG matches the characters TAG literally (case sensitive)
        2nd Alternative: TAA
            TAA matches the characters TAA literally (case sensitive)
        3rd Alternative: TGA
            TGA matches the characters TGA literally (case sensitive)
    g modifier: global. All matches (don't return on first match)
like image 156
mhawke Avatar answered Nov 25 '25 07:11

mhawke