Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Get Line pointers of all lines in a large file

Tags:

python

I have a large file (6-60 GB) that I cannot load entirely into memory. I can read it line by line:

with open(...) as f:
    for line in f:
        # Do something with 'line'

But sometimes when I've read say the nth line, I need information from the n+2th line as well. How do I read this n+2th line while my 'line' object is pointing to the nth line? I will still need to process the later line normally.

I'm worried if I use f.readlines(10), since I don't know the size of my look-ahead (it could be 99).

I thought of one way, which would be having line pointers for each line with seek and tell in a list, but I'm worried about storage space again.

I'm looking for speed over all else in reading this file. Any suggestions?

like image 346
azazelspeaks Avatar asked Dec 05 '25 22:12

azazelspeaks


2 Answers

Your idea of having line references is good, and is quite efficient on storage: an integer refers to each line. However, jumping around the file is not particularly time-effective.

Instead, I recommend that you have a read-ahead buffer. If you're on line n and need data from line n+2, then read ahead those two lines and keep them in memory. Finish processing line n. When you're ready for the next line of input, you already have it in your buffer.

The read priority is (a) buffer; (b) fetch next line from memory.

Is that clear enough to let you progress?

like image 110
Prune Avatar answered Dec 08 '25 11:12

Prune


as a corollary to Prune's answer a queue could do well to implement either a read-ahead buffer or a look behind, heck two queues both kept at a constant length could provide a nice middle view of what's around the current line.

Basically read the line, push on queue. when stack reaches a certain size pop from queue and process, then push on a second queue.

when that second queue reach a given size just pop form it and forget that line.

whenever you need to look around the current line you can look ahead or look behind just by accessing values in either queues.

All this can actually be done with a simple list actually https://docs.python.org/2/tutorial/datastructures.html

to use a list as a read-ahead queue _l.insert(0, line) to insert a line _l.pop() to remove the line and process it

or

_l.append(line) to insert a line _l.pop(0) to remove the line and process it

do make sure the list _l reach the desired size before calling pop now that works only if the lines you need to keep in memory are close to the line you are trying to process.

like image 42
Newtopian Avatar answered Dec 08 '25 12:12

Newtopian