Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

is List badly performing in python?

Tags:

python

I was trying to read data from some huge file and write them back, but I realised that the main cost came from assigning data to a list rather then reading or writing data from/to file....

    rows = [None] * 1446311
    begin = datetime.datetime.now()
    for i in range( 1446311 ):
       row = csvReader.next()
       rows[i] = row
    print datetime.datetime.now() - begin

above code is taking 18 sec but 5 sec if I comment out line 5 (rows[i] = row), I have build the list in advance (i.e. reserve the memory) but why it is still so slow? anything I could do the make it faster? I tried row for row in csvReader but it performs worse...

regards, John

like image 386
John Avatar asked Sep 18 '25 18:09

John


1 Answers

I get similar results, but not quite so dramatic as yours. (Note the use of the timeit module for timing code execution, and note that I've factored out the list creation since its common to both test cases.)

import csv
from timeit import Timer

def write_csv(f, n):
    """Write n records to the file named f."""
    w = csv.writer(open(f, 'wb'))
    for i in xrange(n):
        w.writerow((i, "squared", "equals", i**2))

def test1(rows, f, n):
    for i, r in enumerate(csv.reader(open(f))):
        rows[i] = r

def test2(rows, f, n):
    for i, r in enumerate(csv.reader(open(f))):
        pass

def test(t): 
    return (Timer('test%d(rows, F, N)' % t,
                  'from __main__ import test%d, F, N; rows = [None] * N' % t)
            .timeit(number=1))

>>> N = 1446311
>>> F = "test.csv"
>>> write_csv(F, N)
>>> test(1)
2.2321770191192627
>>> test(2)
1.7048690319061279

Here's my guess as to what is going on. In both tests, the CSV reader reads a record from the file and creates a data structure in memory representing that record.

In test2, where the record is not stored, the data structure gets deleted more or less immediately (on the next iteration of the loop, the row variable is updated, so the reference count of the previous record is decremented, and so the memory is reclaimed). This makes the memory used for the previous record available for reuse: this memory is already in the computer's virtual memory tables, and probably still in the cache, so it's (relatively) fast.

In test1, where the record is stored, each record has to be allocated in a new region of memory, which has to be allocated by the operating system, and copied to the cache, so it's (relatively) slow.

So the time is not taken up by list assignment, but by memory allocation.


Here are another couple of tests that illustrate what's going on, without the complicating factor of the csv module. In test3 we create a new 100-element list for each row, and store it. In test4 we create a new 100-element list for each row, but we don't store it, we throw it away so that the memory can be reused on the next time round the loop.

def test3(rows, f, n):
    for i in xrange(n):
        rows[i] = [i] * 100

def test4(rows, f, n):
    for i in xrange(n):
        temp = [i] * 100
        rows[i] = None

>>> test(3)
9.2103338241577148
>>> test(4)
1.5666921138763428

So I think the lesson is that if you do not need to store all the rows in memory at the same time, don't do that! If you can, read them in one at a time, process them one at a time, and then forget about them so that Python can deallocate them.

like image 109
Gareth Rees Avatar answered Sep 21 '25 10:09

Gareth Rees