problem: I'm grouping results in my DataFrame, look at value_counts(normalize=True)
and try to plot the result in a barplot.
The problem is that the barplot should contain frequencies. In some groups, some values don't occur. In that case, the corresponding value_count
is not 0, it doesn't exist. For the barplot, this 0 value is not taken into account and the resulting bar is too big.
example: Here is a minimal example, which illustrates the problem: Let's say the DataFrame contains observations for experiments. When you perform such an experiment, a series of observations is collected. The result of the experiment are the relative frequencies of the observations collected for it.
df = pd.DataFrame()
df["id"] = [1]*3 + [2]*3 + [3]*3
df["experiment"] = ["a"]*6 + ["b"] * 3
df["observation"] = ["positive"]*3 + ["positive"]*2 + ["negative"]*1 + ["positive"]*2 + ["negative"]*1
So here, experiment a has been done 2 times, experiment b just once.
I need to group by id and experiment, then average the result.
plot_frame = pd.DataFrame(df.groupby(["id", "experiment"])["observation"].value_counts(normalize=True))
plot_frame = plot_frame.rename(columns={"observation":"percentage"})
In the picture above, you can already see the problem. The evaluation with id 1 has seen only positive observations. The relative frequency of "negative" should be 0. Instead, it doesn't exist. If I plot this, the corresponding bar is too high, the blue bars should add up to one:
sns.barplot(data=plot_frame.reset_index(),
x="observation",
hue="experiment",
y="percentage")
plt.show()
You can add rows filled with 0 by using unstack
/stack
method with argument fill_value=0
. Try this:
df.groupby(["id", "experiment"])["observation"].value_counts(normalize=True).unstack(fill_value=0).stack()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With