Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Make pandas.read_csv() ignore junk at the start of the csv files?

Tags:

python

pandas

csv

I've got some junk at the start of my csv file that prevents me selecting the first column of my dataframe by name.

Example:

In[1]: df = pd.read_csv('file:inputdata.csv', usecols=[0], nrows=1)

In[2]: df
Out[2]:
        TAB
0  10-LV_Non

In[3]: df['TAB']
Out[3]: <snip> KeyError: 'TAB'

I found the junk by reading the file with open():

In[4]: with open('inputdata.csv', 'rb') as f:
           print(f.read(7))
Out[4]: b'\xef\xbb\xbfTAB,'

EDIT: '\xef\xbb\xbf' is three bytes of junk. 'TAB' is the name of the first column.

Is there a way to make pandas.read_csv() ignore junks like this (if present) at the start of the csv file?

NB The csv files are exported from a proprietary system, so I can't control their format.

UPDATE: Here's my solution, based on Mike Müller's answer:

with open('inputdata.csv', 'r') as f:
    # Skip past any bytes that aren't text
    while re.match('[a-zA-Z0-9_]', f.read(1)) is None:
        pass
    # Seek back one byte
    f.seek(f.tell()-1)
    # Read the file
    df = pd.read_csv(f, usecols=['TAB'])
like image 430
Li-Wen Yip Avatar asked Dec 10 '25 00:12

Li-Wen Yip


2 Answers

It's unclear to me what exactly is the format of the "junk", but there are a number of options to use.


pandas.read_csv takes a filepath_or_buffer

filepath_or_buffer : string or file handle / StringIO

It follows that if you open a File object, read past the junk, then pass the File object to read_csv, it should be OK.


The skiprows arguments skips rows:

skiprows : list-like or integer, default None

Thus you can possibly skip the junk's row(s).

like image 83
Ami Tavory Avatar answered Dec 12 '25 12:12

Ami Tavory


Something like this could work:

with open('inputdata.csv', 'rb') as f:
    if f.read(7) != b'\xef\xbb\xbfTAB,':
        f.seek(0)
    df = pd.read_csv(f, usecols=[0], nrows=1)

Just read the first seven bytes. If the are good, i.e. not equal to the bytes you don't want, go back to the beginning of the file with seek(0), otherwise start reading at position 7 bytes, skipping the offending bytes.

like image 28
Mike Müller Avatar answered Dec 12 '25 14:12

Mike Müller