Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read lines from HUGE text files at groups of 4

I am facing a problem with python since a few days. I am a bioinformatics with no basic programming skills and I am working with huge text files (25gb approx.) that I have to process.

I have to read the txt file line-by-line at groups of 4lines per time, which means that the first 4 lines has to be read and processed and then I have to read the second group of 4 lines, and so on.

Obviously I cannot use the readlines() operator because it will overload my memory, and I have to use each of the 4 lines for some string recognition.

I thought about using a for cycle with the range operator:

openfile = open(path, 'r')

for elem in range(0, len(openfile), 4):

line1 = readline()
line2 = readline()
line3 = readline()
line4 = readline()
(process lines...)

Unfortunately this is not possibile because the file in "reading" mode cannot be iterated and treated like a list or a dictionary.

Can anybody please help to cycle this properly?

Thanks in advance

like image 575
WarioBrega Avatar asked Dec 04 '25 22:12

WarioBrega


2 Answers

This has low memory overhead. It counts on the fact that a file is an iterator that reads by line.

def grouped(iterator, size):
    yield tuple(next(iterator) for _ in range(size))

Use it like this:

for line1, line2, line3, line4 in grouped(your_open_file, size=4):
    do_stuff_with_lines()

note: This code assumes that the file does not end with a partial group.

like image 183
Steven Rumbalski Avatar answered Dec 07 '25 15:12

Steven Rumbalski


You're reading a fastq file, right? You're most probably reinventing the wheel - you could just use Biopython, it has tools for dealing with common biology file formats. For instance see this tutorial, for doing something with fastq files - it looks basically like this:

from Bio import SeqIO
for record in SeqIO.parse("SRR020192.fastq", "fastq"):
    # do something with record, using record.seq, record.id etc

More on biopython SeqRecord objects here.

Here is another biopython fastq-processing tutorial, including a variant for doing this faster using a lower-level library, like this:

from Bio.SeqIO.QualityIO import FastqGeneralIterator
for title, seq, qual in FastqGeneralIterator(open("untrimmed.fastq")):
    # do things with title,seq,qual values

There's also the HTSeq package, with more deep-sequencing-specific tools, which I actually use more often.

By the way, if you don't know about Biostar already, you could take a look - it's a StackExchange-format site specifically for bioinformatics.

like image 35
weronika Avatar answered Dec 07 '25 15:12

weronika



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!