I get the following error when trying to add the Education attribute from the Loan Prediction dataset to a pairplot using seaborn:
ValueError Traceback (most recent call last) ~/anaconda3/lib/python3.7/site-packages/statsmodels/nonparametric/kde.py in kdensityfft(X, kernel, bw, weights, gridsize, adjust, clip, cut, retgrid) 450 try: --> 451 bw = float(bw) 452 except:
ValueError: could not convert string to float: 'scott'
I have looked through the raw data, but I could not find 'scott' anywhere, so my question is where does this come from and how can I fix it?
Also I get a runtime error "RuntimeError: Selected KDE bandwidth is 0. Cannot estiamte density.". I'm not sure wether this is caused by the first error, or that it is a seperate issue altogether. If anybody could shine any light on this I would be grateful.
I am using the Loan Prediction Dataset found here. The attributes are as follows:
Loan_ID Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History Property_Area Loan_Status
0 LP001002 Male No 0 Graduate No 5849 0.0 NaN 360.0 1.0 Urban Y
1 LP001003 Male Yes 1 Graduate No 4583 1508.0 128.0 360.0 1.0 Rural N
2 LP001005 Male Yes 0 Graduate Yes 3000 0.0 66.0 360.0 1.0 Urban Y
3 LP001006 Male Yes 0 Not Graduate No 2583 2358.0 120.0 360.0 1.0 Urban Y
4 LP001008 Male No 0 Graduate No 6000 0.0 141.0 360.0 1.0 Urban Y
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline # I'm using ipython notebook
train_data = pd.read_csv("train_ctrUa4K.csv")
bad_credit = train_data[train_data["Credit_History"] == 0]
bad_credit["Education"] = bad_credit["Education"].map({"Graduate":1,"Not Graduate":0})
sns.pairplot(bad_credit,vars=["ApplicantIncome","Education","LoanAmount"],hue="Loan_Status")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
~/anaconda3/lib/python3.7/site-packages/statsmodels/nonparametric/kde.py in kdensityfft(X, kernel, bw, weights, gridsize, adjust, clip, cut, retgrid)
450 try:
--> 451 bw = float(bw)
452 except:
ValueError: could not convert string to float: 'scott'
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last)
<ipython-input-25-0cd48ab0d803> in <module>
2 bad_credit = train_data[train_data["Credit_History"] == 0]
3 bad_credit["Education"] = bad_credit["Education"].map({"Graduate":1,"Not Graduate":0})
----> 4 sns.pairplot(bad_credit,vars=["ApplicantIncome","Education","LoanAmount"],hue="Loan_Status")
~/anaconda3/lib/python3.7/site-packages/seaborn/axisgrid.py in pairplot(data, hue, hue_order, palette, vars, x_vars, y_vars, kind, diag_kind, markers, height, aspect, corner, dropna, plot_kws, diag_kws, grid_kws, size)
2119 diag_kws.setdefault("shade", True)
2120 diag_kws["legend"] = False
-> 2121 grid.map_diag(kdeplot, **diag_kws)
2122
2123 # Maybe plot on the off-diagonals
~/anaconda3/lib/python3.7/site-packages/seaborn/axisgrid.py in map_diag(self, func, **kwargs)
1488 data_k = utils.remove_na(data_k)
1489
-> 1490 func(data_k, label=label_k, color=color, **kwargs)
1491
1492 self._clean_axis(ax)
~/anaconda3/lib/python3.7/site-packages/seaborn/distributions.py in kdeplot(data, data2, shade, vertical, kernel, bw, gridsize, cut, clip, legend, cumulative, shade_lowest, cbar, cbar_ax, cbar_kws, ax, **kwargs)
703 ax = _univariate_kdeplot(data, shade, vertical, kernel, bw,
704 gridsize, cut, clip, legend, ax,
--> 705 cumulative=cumulative, **kwargs)
706
707 return ax
~/anaconda3/lib/python3.7/site-packages/seaborn/distributions.py in _univariate_kdeplot(data, shade, vertical, kernel, bw, gridsize, cut, clip, legend, ax, cumulative, **kwargs)
293 x, y = _statsmodels_univariate_kde(data, kernel, bw,
294 gridsize, cut, clip,
--> 295 cumulative=cumulative)
296 else:
297 # Fall back to scipy if missing statsmodels
~/anaconda3/lib/python3.7/site-packages/seaborn/distributions.py in _statsmodels_univariate_kde(data, kernel, bw, gridsize, cut, clip, cumulative)
365 fft = kernel == "gau"
366 kde = smnp.KDEUnivariate(data)
--> 367 kde.fit(kernel, bw, fft, gridsize=gridsize, cut=cut, clip=clip)
368 if cumulative:
369 grid, y = kde.support, kde.cdf
~/anaconda3/lib/python3.7/site-packages/statsmodels/nonparametric/kde.py in fit(self, kernel, bw, fft, weights, gridsize, adjust, cut, clip)
138 density, grid, bw = kdensityfft(endog, kernel=kernel, bw=bw,
139 adjust=adjust, weights=weights, gridsize=gridsize,
--> 140 clip=clip, cut=cut)
141 else:
142 density, grid, bw = kdensity(endog, kernel=kernel, bw=bw,
~/anaconda3/lib/python3.7/site-packages/statsmodels/nonparametric/kde.py in kdensityfft(X, kernel, bw, weights, gridsize, adjust, clip, cut, retgrid)
451 bw = float(bw)
452 except:
--> 453 bw = bandwidths.select_bandwidth(X, bw, kern) # will cross-val fit this pattern?
454 bw *= adjust
455
~/anaconda3/lib/python3.7/site-packages/statsmodels/nonparametric/bandwidths.py in select_bandwidth(x, bw, kernel)
172 # eventually this can fall back on another selection criterion.
173 err = "Selected KDE bandwidth is 0. Cannot estiamte density."
--> 174 raise RuntimeError(err)
175 else:
176 return bandwidth
RuntimeError: Selected KDE bandwidth is 0. Cannot estiamte density.
scott
is the name of a method to choose the bandwidth when plotting a Kernel Density estimation (KDE). It is named after DW Scott (1).
I cannot look at your data, but my guess is that something is weird with one of the pairs of variable for a certain hue-level that prevents seaborn to calculate the proper bandwith.
you could use diag_kws
to pass arguments to sns.kdeplot()
, which is used by pairplot to plot the univariate distributions on the diagonal.
for example:
sns.pairplot(..., diag_kws={'bw':'silverman'})
would force sns.kdeplot()
to use the "silverman" method to choose the bandwith, which might work better than the Scott method in your case?
(1) D.W. Scott, “Multivariate Density Estimation: Theory, Practice, and Visualization”, John Wiley & Sons, New York, Chicester, 1992.
EDIT
To try and pinpoint the culprit, you would have to use PairGrid
instead of pairplot()
. PairGrid
allows you to use a custom function to plot the diagonal. If you include a print statement in that function, you can see what is the data that would be passed to sns.kdeplot(). The execution should stop at the point where the data is "incorrect" and you might be able to figure out what to do with that.
for example:
def test_func(*data, **kwargs):
print("data received:", data)
print("hue name + other params:", kwargs)
sns.kdeplot(*data, **kwargs)
iris = sns.load_dataset('iris')
g = sns.PairGrid(iris, hue="species")
g = g.map_diag(test_func)
For each variable (column), and for each leveyou get an output that will look like this:
data received: (array([5.1, 4.9, 4.7, 4.6, 5. , 5.4, 4.6, 5. , 4.4, 4.9, 5.4, 4.8, 4.8,
4.3, 5.8, 5.7, 5.4, 5.1, 5.7, 5.1, 5.4, 5.1, 4.6, 5.1, 4.8, 5. ,
5. , 5.2, 5.2, 4.7, 4.8, 5.4, 5.2, 5.5, 4.9, 5. , 5.5, 4.9, 4.4,
5.1, 5. , 4.5, 4.4, 5. , 5.1, 4.8, 5.1, 4.6, 5.3, 5. ]),)
hue name + other params: {'label': 'setosa', 'color': (0.12156862745098039, 0.4666666666666667, 0.7058823529411765)}
data received: (array([7. , 6.4, 6.9, 5.5, 6.5, 5.7, 6.3, 4.9, 6.6, 5.2, 5. , 5.9, 6. ,
6.1, 5.6, 6.7, 5.6, 5.8, 6.2, 5.6, 5.9, 6.1, 6.3, 6.1, 6.4, 6.6,
6.8, 6.7, 6. , 5.7, 5.5, 5.5, 5.8, 6. , 5.4, 6. , 6.7, 6.3, 5.6,
5.5, 5.5, 6.1, 5.8, 5. , 5.6, 5.7, 5.7, 6.2, 5.1, 5.7]),)
hue name + other params: {'label': 'versicolor', 'color': (1.0, 0.4980392156862745, 0.054901960784313725)}
(...)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With