Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difficulties using seaborn barplot with categorical data

Tags:

python

seaborn

I've been encountering a recurrent problem with using seaborn's "categorical" plotting functions to actually plot rates of categorical data.

I crafted a simple example here that I could have sworn used to work with seaborn. I managed to find a workaround using dummy variables, but this isn't always convenient. Does anyone know why my "Version 2" use case for barplot doesn't work?

import pandas as pd
from pandas import DataFrame
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Generate some example data of labels and associated values
outcomes = ['A' for _ in range(50)] + \
           ['B' for _ in range(20)] + \
           ['C' for _ in range(5)] 
trial = range(len(outcomes))

df = DataFrame({'Trial': trial, 'Outcome': outcomes})

plt.close('all')

# Version 1: This works but is a non-ideal workaround

# Generate separate boolean columns for each outcome
df2 = pd.get_dummies(df.Outcome).astype(bool)

plt.figure()
sns.barplot(data=df2, estimator=lambda x: 100 * np.mean(x))
plt.title('Outcomes V1')
plt.ylabel('Percent Trials')
plt.ylim([0,100])
plt.show()

# Version 2: This doesn't work and results in the following error
# unsupported operand type(s) for /: 'str' and 'int' 
plt.figure()
sns.barplot(x='Outcome', data=df, estimator=lambda x: 100 * np.mean(x))
plt.title('Outcomes V2')
plt.ylabel('Percent Trials')
plt.ylim([0,100])
plt.show()

Here's what I'm expecting the plot to look like.

like image 462
Brandon Itkowitz Avatar asked Jan 24 '26 18:01

Brandon Itkowitz


1 Answers

Adding the y parameter would work for you:

sns.barplot(x='Outcome', y='Trial', data=df, estimator=lambda x: 100 * np.mean(x))

However, in your case it makes more sense to plot with sns.countplot (since you want to treat trial 10 as one occurence, not the actual number ten):

sns.countplot(x='Outcome', data=df)

Of, if you want percentages, you could do something like:

sns.barplot(x='Outcome', y='Trial', data=df, estimator=lambda x: len(x) / len(df) * 100)  

Explanation

With a wide form data frame (such as df2), you can pass only the data frame to the data parameter, and Seaborn will automatically plot each numeric column along the x-axis.

With a long-form data frame (such as df), you need to pass arguments to both the x and y parameters.

From the sns.barplot docstring (em added):

Input data can be passed in a variety of formats, including:

  • Vectors of data represented as lists, numpy arrays, or pandas Series objects passed directly to the x, y, and/or hue parameters.
  • A "long-form" DataFrame, in which case the x, y, and hue variables will determine how the data are plotted.
  • A "wide-form" DataFrame, such that each numeric column will be plotted.
  • Anything accepted by plt.boxplot (e.g. a 2d array or list of vectors)
like image 111
joelostblom Avatar answered Jan 26 '26 10:01

joelostblom



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!