Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Splicing through a line of a textfile using python

I am trying to create genetic signatures. I have a textfile full of DNA sequences. I want to read in each line from the text file. Then add 4mers which are 4 bases into a dictionary. For example: Sample sequence

ATGATATATCTATCAT

What I want to add is ATGA, TGAT, GATA, etc.. into a dictionary with ID's that just increment by 1 while adding the 4mers.

So the dictionary will hold...

Genetic signatures, ID
ATGA,1
TGAT, 2
GATA,3

Here is what I have so far...

import sys  

def main ():
    readingFile = open("signatures.txt", "r")
    my_DNA=""

    DNAseq = {} #creates dictionary 

    for char in readingFile:
        my_DNA = my_DNA+char

    for char in my_DNA:             
        index = 0
        DnaID=1
        seq = my_DNA[index:index+4]         

        if (DNAseq.has_key(seq)): #checks if the key is in the dictionary
            index= index +1
        else :
            DNAseq[seq] = DnaID
            index = index+1
            DnaID= DnaID+1

    readingFile.close()

if __name__ == '__main__':
    main()

Here is my output:

ACTC
ACTC
ACTC
ACTC
ACTC
ACTC

This output suggests that it is not iterating through each character in string... please help!

like image 641
brooklynchick Avatar asked Apr 05 '13 02:04

brooklynchick


2 Answers

You need to move your index and DnaID declarations before the loop, otherwise they will be reset every loop iteration:

index = 0
DnaID=1
for char in my_DNA:             
    #... rest of loop here

Once you make that change you will have this output:

ATGA 1
TGAT 2
GATA 3
ATAT 4
TATA 5
ATAT 6
TATC 6
ATCT 7
TCTA 8
CTAT 9
TATC 10
ATCA 10
TCAT 11
CAT 12
AT 13
T 14

In order to avoid the last 3 items which are not the correct length you can modify your loop:

for i in range(len(my_DNA)-3):
    #... rest of loop here

This doesn't loop through the last 3 characters, making the output:

ATGA 1
TGAT 2
GATA 3
ATAT 4
TATA 5
ATAT 6
TATC 6
ATCT 7
TCTA 8
CTAT 9
TATC 10
ATCA 10
TCAT 11
like image 93
CraigTeegarden Avatar answered Nov 14 '22 23:11

CraigTeegarden


This should give you the desired effect.

from collections import defaultdict

readingFile = open("signatures.txt", "r").read()
DNAseq      = defaultdict(int)
window      = 4

for i in xrange(len(readingFile)):
    current_4mer = readingFile[i:i+window]
    if len(current_4mer) == window:
        DNAseq[current_4mer] += 1

print DNAseq
like image 32
user2008141 Avatar answered Nov 14 '22 22:11

user2008141