Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read a few lines in a large CSV file with pandas?

Tags:

python

pandas

csv

I have a CSV file that doesn't fit into my system's memory. Using Pandas, I want to read a small number of rows scattered all over the file.

I think that I can accomplish this without pandas following the steps here: How to read specific lines of a large csv file

In pandas, I am trying to use skiprows to select only the rows that I need.

# FILESIZE is the number of lines in the CSV file (~600M)
# rows2keep is an np.array with the line numbers that I want to read (~20)

rows2skip = (row for row in range(0,FILESIZE) if row not in rows2keep)
signal = pd.read_csv('train.csv', skiprows=rows2skip)

I would expect this code to return a small dataframe pretty fast. However, what is does is start consuming memory over several minutes until the system becomes irresponsive. I'm guessing that it is reading the whole dataframe first and will get rid of rows2skip later.

Why is this implementation so inefficient? How can I efficiently create a dataframe with only the lines specified in rows2keep?

like image 896
ontheway Avatar asked Jan 26 '26 08:01

ontheway


1 Answers

Try this

train = pd.read_csv('file.csv', iterator=True, chunksize=150000)

If you only want to read the first n rows:

train = pd.read_csv(..., nrows=n)

If you only want to read rows from n to n+100

train = pd.read_csv(..., skiprows=n, nrows=n+100)
like image 138
Hrithik Puri Avatar answered Jan 27 '26 22:01

Hrithik Puri



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!