Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count number of new lines outside double quotes in a 30GB+ csv file in python

I have a big file(30GB+) csv where I am counting the number of newlines by parsing the file block by block

Using the below function

def yieldblocks(file, size=1024*1024):
    while True:
        blocks = file.read(size)
        if not blocks: break
        yield blocks

And calling it so,

sum(bl.count("\n") for bl in blocks(txtfile))

I am able to count the newlines in slightly under an hour( I am surprised thats the best I could get too )

My problem is I need to skip the new lines coming within double quotes as some columns have multiline content.

I tried the below but it does not seem to be working and process exits without results

sum(.5 if re.search('^[^"]*"(?=[^"]*(?:"[^"]*")*[^"]*$).*$', bl) else 1 for bl in yieldblocks(txtfile))

The regular expression is to find odd number of double quotes characters in one line and is working in a small sized file.

I am on a 2GB RAM, 32 bit OS

Note: I tried the CSV module but its slower compared to counting by blocks and was hoping I could somehow get this to work

like image 683
QVSJ Avatar asked Dec 03 '25 00:12

QVSJ


1 Answers

This may work well for you. pandas.read_csv is generally very fast, but I haven't tried it chunk-wise.

import pandas as pd
reader = pd.read_csv('file.csv', sep=',', chunksize=10000, low_memory=True)
line_count = sum(len(chunk) for chunk in reader)

There is a little more info (not much) in the documentation.

like image 195
Matthias Fripp Avatar answered Dec 04 '25 14:12

Matthias Fripp