I have a pandas-DataFrame:
data = pd.read_csv(r'C:\data-path\demographics.csv', sep=',')
print(data)
PersonID Married No. of Children Sex
1 yes 0 male
2 no 0 female
3 no 1 male
4 yes 1 male
5 no 1 female
6 no 2 female
7 no 1 male
8 no 2 male
9 no 2 male
10 no 1 male
11 no 0 female
Now I try to create a mosaicplot out of it, using statsmodels.graphics.mosaicplot
mosaic(data, ['Married', 'No. of Children'])
...which works, however, whenever I try to add a third dimension, for example:
mosaic(data, ['Married', 'No. of Children', 'Sex'])
... I get the following error-message:
ValueError: at least one proportion should begreater than zero
I am not sure, what it wants from me. Is there some parameter missing/wrongly set?
It also doesn't matter which columns/dimensions I choose, or in what order. Whenever I have more than 2, I get an error.
Anybody have an idea?
Thanks in advance
After some tinkering I didn't find the solution but the origin of the bug.
It lies within the code of the mosaicplot-class/-module: http://nipy.bic.berkeley.edu/nightly/statsmodels/doc/html/_modules/statsmodels/graphics/mosaicplot.html
In short: It is unable to handle classes in a dataset, which are empty, i.e. have 0 instances.
Consider the dataset from the original question and then the following function call:
mosaic(data, ['Married', 'No. of Children'])
What the 'mosaic'-method will do, is to determine how many classes the first category has (in this case: 2), and then how often each of the classes occurs. It will then generate a list of 'Proportions' for the plot's rectangles based on that, which in case of the 'Married'-category will be
[2, 9]
...as there are 2 'yes' and 9 'no' instances.
For each of the classes, there will be new splits, according to the second category, here: 'No. of Children'. There are 3 classes (0, 1, and 2) and this will generate the folloing 'proportions':
[1, 1, 0] (1 married with 0 children, 1 married with 1 child, 0 married with 2 children)
[2, 4, 3] ( 1 single with 0 children, etc....)
Based on the above alone, it is able to draw a perfectly fine mosaic plot.
However, once we take a third category into account (for instance: 'Sex'), the 0 in one of the lists above becomes a problem. It will spawn the list [0, 0] as they are 0 married men/women with 2 children.
And in line 45 of the source code, there is an if-clause that will raise the exception at all-0-lists (as they are 'not meaningful').
As said above, I was not able to find a fix/workaround to do this. Simply out-commenting said if-clause will allow all the splits to be performed normally, however, this will also cause the drawing of the mosaic-plot to throw an exception in matplotlib's backend_agg.py, as somehow they are now values which are NaN (not a number).
Why this is, I have no idea, and I would be glad if someone brighter and more experienced than me would look into this.
I still won't rule out that I have just have to set some parameters differently.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With