Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Who is Scott? - ValueError in Seaborn pairplot: Could not convert string to float: 'scott'

Who is Scott?

Problem

I get the following error when trying to add the Education attribute from the Loan Prediction dataset to a pairplot using seaborn:

ValueError Traceback (most recent call last) ~/anaconda3/lib/python3.7/site-packages/statsmodels/nonparametric/kde.py in kdensityfft(X, kernel, bw, weights, gridsize, adjust, clip, cut, retgrid) 450 try: --> 451 bw = float(bw) 452 except:

ValueError: could not convert string to float: 'scott'

I have looked through the raw data, but I could not find 'scott' anywhere, so my question is where does this come from and how can I fix it?

Also I get a runtime error "RuntimeError: Selected KDE bandwidth is 0. Cannot estiamte density.". I'm not sure wether this is caused by the first error, or that it is a seperate issue altogether. If anybody could shine any light on this I would be grateful.

Dataset

I am using the Loan Prediction Dataset found here. The attributes are as follows:

    Loan_ID     Gender  Married     Dependents  Education     Self_Employed     ApplicantIncome     CoapplicantIncome   LoanAmount  Loan_Amount_Term    Credit_History  Property_Area   Loan_Status
0   LP001002    Male    No          0           Graduate      No                5849                0.0                 NaN         360.0               1.0             Urban           Y
1   LP001003    Male    Yes         1           Graduate      No                4583                1508.0              128.0       360.0               1.0             Rural           N
2   LP001005    Male    Yes         0           Graduate      Yes               3000                0.0                 66.0        360.0               1.0             Urban           Y
3   LP001006    Male    Yes         0           Not Graduate  No                2583                2358.0              120.0       360.0               1.0             Urban           Y
4   LP001008    Male    No          0           Graduate      No                6000                0.0                 141.0       360.0               1.0             Urban           Y

Code

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline # I'm using ipython notebook

train_data = pd.read_csv("train_ctrUa4K.csv")

bad_credit = train_data[train_data["Credit_History"] == 0]
bad_credit["Education"] = bad_credit["Education"].map({"Graduate":1,"Not Graduate":0})
sns.pairplot(bad_credit,vars=["ApplicantIncome","Education","LoanAmount"],hue="Loan_Status")

Error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
~/anaconda3/lib/python3.7/site-packages/statsmodels/nonparametric/kde.py in kdensityfft(X, kernel, bw, weights, gridsize, adjust, clip, cut, retgrid)
    450     try:
--> 451         bw = float(bw)
    452     except:

ValueError: could not convert string to float: 'scott'

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
<ipython-input-25-0cd48ab0d803> in <module>
      2 bad_credit = train_data[train_data["Credit_History"] == 0]
      3 bad_credit["Education"] = bad_credit["Education"].map({"Graduate":1,"Not Graduate":0})
----> 4 sns.pairplot(bad_credit,vars=["ApplicantIncome","Education","LoanAmount"],hue="Loan_Status")

~/anaconda3/lib/python3.7/site-packages/seaborn/axisgrid.py in pairplot(data, hue, hue_order, palette, vars, x_vars, y_vars, kind, diag_kind, markers, height, aspect, corner, dropna, plot_kws, diag_kws, grid_kws, size)
   2119             diag_kws.setdefault("shade", True)
   2120             diag_kws["legend"] = False
-> 2121             grid.map_diag(kdeplot, **diag_kws)
   2122 
   2123     # Maybe plot on the off-diagonals

~/anaconda3/lib/python3.7/site-packages/seaborn/axisgrid.py in map_diag(self, func, **kwargs)
   1488                     data_k = utils.remove_na(data_k)
   1489 
-> 1490                 func(data_k, label=label_k, color=color, **kwargs)
   1491 
   1492             self._clean_axis(ax)

~/anaconda3/lib/python3.7/site-packages/seaborn/distributions.py in kdeplot(data, data2, shade, vertical, kernel, bw, gridsize, cut, clip, legend, cumulative, shade_lowest, cbar, cbar_ax, cbar_kws, ax, **kwargs)
    703         ax = _univariate_kdeplot(data, shade, vertical, kernel, bw,
    704                                  gridsize, cut, clip, legend, ax,
--> 705                                  cumulative=cumulative, **kwargs)
    706 
    707     return ax

~/anaconda3/lib/python3.7/site-packages/seaborn/distributions.py in _univariate_kdeplot(data, shade, vertical, kernel, bw, gridsize, cut, clip, legend, ax, cumulative, **kwargs)
    293         x, y = _statsmodels_univariate_kde(data, kernel, bw,
    294                                            gridsize, cut, clip,
--> 295                                            cumulative=cumulative)
    296     else:
    297         # Fall back to scipy if missing statsmodels

~/anaconda3/lib/python3.7/site-packages/seaborn/distributions.py in _statsmodels_univariate_kde(data, kernel, bw, gridsize, cut, clip, cumulative)
    365     fft = kernel == "gau"
    366     kde = smnp.KDEUnivariate(data)
--> 367     kde.fit(kernel, bw, fft, gridsize=gridsize, cut=cut, clip=clip)
    368     if cumulative:
    369         grid, y = kde.support, kde.cdf

~/anaconda3/lib/python3.7/site-packages/statsmodels/nonparametric/kde.py in fit(self, kernel, bw, fft, weights, gridsize, adjust, cut, clip)
    138             density, grid, bw = kdensityfft(endog, kernel=kernel, bw=bw,
    139                     adjust=adjust, weights=weights, gridsize=gridsize,
--> 140                     clip=clip, cut=cut)
    141         else:
    142             density, grid, bw = kdensity(endog, kernel=kernel, bw=bw,

~/anaconda3/lib/python3.7/site-packages/statsmodels/nonparametric/kde.py in kdensityfft(X, kernel, bw, weights, gridsize, adjust, clip, cut, retgrid)
    451         bw = float(bw)
    452     except:
--> 453         bw = bandwidths.select_bandwidth(X, bw, kern) # will cross-val fit this pattern?
    454     bw *= adjust
    455 

~/anaconda3/lib/python3.7/site-packages/statsmodels/nonparametric/bandwidths.py in select_bandwidth(x, bw, kernel)
    172         # eventually this can fall back on another selection criterion.
    173         err = "Selected KDE bandwidth is 0. Cannot estiamte density."
--> 174         raise RuntimeError(err)
    175     else:
    176         return bandwidth

RuntimeError: Selected KDE bandwidth is 0. Cannot estiamte density.


like image 226
10778403 Avatar asked Oct 15 '25 04:10

10778403


1 Answers

scott is the name of a method to choose the bandwidth when plotting a Kernel Density estimation (KDE). It is named after DW Scott (1).

I cannot look at your data, but my guess is that something is weird with one of the pairs of variable for a certain hue-level that prevents seaborn to calculate the proper bandwith.

you could use diag_kws to pass arguments to sns.kdeplot(), which is used by pairplot to plot the univariate distributions on the diagonal.

for example:

sns.pairplot(..., diag_kws={'bw':'silverman'})

would force sns.kdeplot() to use the "silverman" method to choose the bandwith, which might work better than the Scott method in your case?

(1) D.W. Scott, “Multivariate Density Estimation: Theory, Practice, and Visualization”, John Wiley & Sons, New York, Chicester, 1992.

EDIT

To try and pinpoint the culprit, you would have to use PairGrid instead of pairplot(). PairGrid allows you to use a custom function to plot the diagonal. If you include a print statement in that function, you can see what is the data that would be passed to sns.kdeplot(). The execution should stop at the point where the data is "incorrect" and you might be able to figure out what to do with that.

for example:

def test_func(*data, **kwargs):
    print("data received:", data)
    print("hue name + other params:", kwargs)
    sns.kdeplot(*data, **kwargs)

iris = sns.load_dataset('iris')
g = sns.PairGrid(iris, hue="species")
g = g.map_diag(test_func)

For each variable (column), and for each leveyou get an output that will look like this:

data received: (array([5.1, 4.9, 4.7, 4.6, 5. , 5.4, 4.6, 5. , 4.4, 4.9, 5.4, 4.8, 4.8,
       4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 4.6, 5.1, 4.8, 5. ,
       5. , 5.2, 5.2, 4.7, 4.8, 5.4, 5.2, 5.5, 4.9, 5. , 5.5, 4.9, 4.4,
       5.1, 5. , 4.5, 4.4, 5. , 5.1, 4.8, 5.1, 4.6, 5.3, 5. ]),)
hue name + other params: {'label': 'setosa', 'color': (0.12156862745098039, 0.4666666666666667, 0.7058823529411765)}
data received: (array([7. , 6.4, 6.9, 5.5, 6.5, 5.7, 6.3, 4.9, 6.6, 5.2, 5. , 5.9, 6. ,
       6.1, 5.6, 6.7, 5.6, 5.8, 6.2, 5.6, 5.9, 6.1, 6.3, 6.1, 6.4, 6.6,
       6.8, 6.7, 6. , 5.7, 5.5, 5.5, 5.8, 6. , 5.4, 6. , 6.7, 6.3, 5.6,
       5.5, 5.5, 6.1, 5.8, 5. , 5.6, 5.7, 5.7, 6.2, 5.1, 5.7]),)
hue name + other params: {'label': 'versicolor', 'color': (1.0, 0.4980392156862745, 0.054901960784313725)}
(...)
like image 117
Diziet Asahi Avatar answered Oct 17 '25 19:10

Diziet Asahi



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!