How to read a few lines in a large CSV file with pandas?

Question

I have a CSV file that doesn't fit into my system's memory. Using Pandas, I want to read a small number of rows scattered all over the file.

I think that I can accomplish this without pandas following the steps here: How to read specific lines of a large csv file

In pandas, I am trying to use skiprows to select only the rows that I need.

# FILESIZE is the number of lines in the CSV file (~600M)
# rows2keep is an np.array with the line numbers that I want to read (~20)

rows2skip = (row for row in range(0,FILESIZE) if row not in rows2keep)
signal = pd.read_csv('train.csv', skiprows=rows2skip)

I would expect this code to return a small dataframe pretty fast. However, what is does is start consuming memory over several minutes until the system becomes irresponsive. I'm guessing that it is reading the whole dataframe first and will get rid of rows2skip later.

Why is this implementation so inefficient? How can I efficiently create a dataframe with only the lines specified in rows2keep?

Hrithik Puri · Accepted Answer

Try this

train = pd.read_csv('file.csv', iterator=True, chunksize=150000)

If you only want to read the first n rows:

train = pd.read_csv(..., nrows=n)

If you only want to read rows from n to n+100

train = pd.read_csv(..., skiprows=n, nrows=n+100)

How to read a few lines in a large CSV file with pandas?

Tags:

python

pandas

csv

ontheway

1 Answers

Hrithik Puri

Recent Activity

Donate For Us

How to read a few lines in a large CSV file with pandas?

Tags:

python

pandas

csv

ontheway

1 Answers

Hrithik Puri

Related questions

Recent Activity

Donate For Us