I am trying to calculate the Chi square value in python, using a contingency table. Here is an example.
+--------+------+------+
|        | Cat1 | Cat2 |
+--------+------+------+
| Group1 |   80 |  120 |
| Group2 |  420 |  380 |
+--------+------+------+
The expected values are:
+--------+------+------+
|        | Cat1 | Cat2 |
+--------+------+------+
| Group1 |  100 |  100 |
| Group2 |  400 |  400 |
+--------+------+------+
If I calculate the Chi square value by hand I get 10. With python however I get 9.506. I use the following code:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency
import scipy
# Some fake data.
n = 5  # Number of samples.
d = 3  # Dimensionality.
c = 2  # Number of categories.
data = np.random.randint(c, size=(n, d))
data = pd.DataFrame(data, columns=['CAT1', 'CAT2', 'CAT3'])
# Contingency table.
contingency = pd.crosstab(data['CAT1'], data['CAT2'])
contingency.iloc[0][0]=80
contingency.iloc[0][1]=120
contingency.iloc[1][0]=420
contingency.iloc[1][1]=380
# Chi-square test of independence.
chi, p, dof, expected = chi2_contingency(contingency)
It is weird that the function gives me the correct expected values, however the Chi square and p-value are off. What am I doing wrong here?
Thanks
p.s.
I am aware that I create the initial table in pandas is pretty lame, but I am not an expert on how to create these nested tables in pandas.
From the documentation:
correction : bool, optional
If True, and the degrees of freedom is 1, apply Yates’ correction for continuity.
The effect of the correction is to adjust each observed value by 0.5 towards
the corresponding expected value.
And degrees of freedom is 1. Is you set correction to False, you'll get 10.
chi2_contingency(contingency, correction=False)
>>> (10.0, 0.001565402258002549, 1, array([[ 100.,  100.],
    [ 400.,  400.]]))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With