Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ipython pandas TypeError: read_csv() got an unexpected keyword argument 'delim-whitespace''

While trying the ipython.org notebook, "INTRODUCTION TO PYTHON FOR DATA MINING"

The following code:

data = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data-original",
               delim_whitespace = True, header=None,
               names = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration',
                        'model', 'origin', 'car_name'])

yields the following error:

 TypeError: read_csv() got an unexpected keyword argument 'delim-whitespace'

Unfortunately the dataset file itself is not really csv, and I don't know why they used read_csv() to get its data.

The data looks like this line:

 14.0   8.   454.0      220.0      4354.       9.0   70.  1.    "chevrolet impala"

The environment is python/2.7 on Debian stable w/ ipython 0.13. After searching here, I realize it's mostly likely a version problem, as the argument 'delim-whitespace' maybe in a later version of the pandas library, than the one available to the APT package manager.

I tried several workarounds, without success.

  • First, I tried to upgrade pandas, by building from latest source, but i found i would end up with a cascade of other builds of dependencies whose versions need upgrading and could end up breaking the environment. E.g., I had to install Cython, then it reported it was again a version too old on the APT package manager, so I would have to rebuild Cython, + other libs/modules and so on.

  • Then after looking at the API a bit, I tried using other arguments: using delimiter = ' ' in the call to read_csv() caused it to break up the strings inside quotes into several columns,

    ValueError: Expecting 9 columns, got 13 in row 0
    
  • I tried using the read_csv() argument quotechar='"' , as documented in the API but again it was not recognized (unexpected keyword argument)

  • Finally I tried using a different way to load the file,

    data = DataFrame()
    
    data.from_csv(url)
    

    I got,

    Out[18]: 
    <class 'pandas.core.frame.DataFrame'>
    Index: 405 entries, 15.0   8.   350.0      165.0      3693.      11.5   70.  1."buick skylark 320" to 31.0   4.   119.0      82.00      2720.      19.4   82.  1.   "chevy s-10"
    Empty DataFrame
    
    In [19]: print(data.shape)
    (0, 9)
    
  • alternatively, w/ sep argument to from_csv(),

    In [20]: data.from_csv(url,sep=' ')
    

    yields the error,

    ValueError: Expecting 31 columns, got 35 in row 1
    In [21]: print(data.shape)
    (0, 9)
    
  • Also alternatively, with the same negative result:

    In [32]: data = DataFrame( columns = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration','model', 'origin', 'car_name'])
    
    In [33]: data.from_csv(url,sep=', \t')Out[33]: 
    <class 'pandas.core.frame.DataFrame'>
    Index: 405 entries, 15.0   8.   350.0      165.0      3693.      11.5   70.  1."buick skylark 320" to 31.0   4.   119.0      82.00      2720.      19.4   82.  1.   "chevy s-10"
    Empty DataFrame
    
    In [34]: data.head()
    Out[34]: 
    Empty DataFrame
    
  • I tried using ipython3 instead, but it cannot find/load matplotlib as there is not matplotlib for python3 for my system.

Any help with this problem would be greatly appreciated.

like image 380
importError Avatar asked Oct 19 '25 06:10

importError


2 Answers

Oddly, the delim_whitespace parameter appears in the Pandas documentation in the method summary but not the parameters list. Try replacing it with delimiter = r'\s+', which is equivalent to what I assume the authors meant.

CSV does refer to comma-separated values, but it's often used to refer to general delimited-text formats. TSV (tab-separated values) is another variant; in this case it's basically whitespace-separated values.

like image 160
Steve Howard Avatar answered Oct 20 '25 21:10

Steve Howard


Your code uses delim_whitespace but the error message says delim-whitespace. The former exists, the latter does not.

If the data file contains

 14.0   8.   454.0      220.0      4354.       9.0   70.  1.    "chevrolet impala"

and you define data with

data = pd.read_csv('data', delim_whitespace = True, header=None, names = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model', 'origin', 'car_name'])

then the DataFrame does get parsed successfully:

   mpg  cylinders  displacement  horsepower  weight  acceleration  model  \
0   14          8           454         220    4354             9     70   

   origin          car_name  
0       1  chevrolet impala  

So you just have change the hyphen to an underscore.


Note that when you specify delim_whitespace=True, the pure Python parser is used. In this case I don't think that is necessary. Using delimiter=r'\s+' as Steve Howard suggests would probably perform better. (The source code says, "The C engine is faster while the python engine is currently more feature-complete", but I think the only feature that the python engine has that the C engine does not is skipfooter.)

like image 23
unutbu Avatar answered Oct 20 '25 20:10

unutbu



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!