Python Pandas find statistical difference between 2 distributions

Question

i have 2 columns with similar data. I plot them to compare their distributions and i want to quantify their difference.

df = pd.DataFrame({'a':['cat','dog','bird','cat','dog','dog','dog'],
             'b':['cat','cat','cat','bird','dog','dog','dog']})

I then plot the 2 columns of my data frame to compare their distributions:

ax = df['a'].value_counts().plot(kind='bar', color='blue', width=.75, legend=True, alpha=0.8)
df['b'].value_counts().plot(kind='bar', color='maroon', width=.5, alpha=1, legend=True)

enter image description here

How can i quantify the difference in the distributions statistically to say how similar they are?

would it be a simple t-test or something else?

Ami Tavory · Accepted Answer

It is very common to use the two-sided Kolmogorov-Smirnov test for this.

In Python, you can do so with scipy.stats.ks_2samp:

from scipy import stats

merged = pd.merge(
    df.a.value_counts().to_frame(),
    df.b.value_counts().to_frame(),
    left_index=True,
    right_index=True)

stats.ks_2samp(merged.a, merged.b)

Broadly speaking, if the second value of the returned tuple is small (say less than 0.05), you should reject the hypothesis that the distributions are the same.

Python Pandas find statistical difference between 2 distributions

Tags:

python

pandas

numpy

scipy

jxn

1 Answers

Ami Tavory

Recent Activity

Donate For Us

Python Pandas find statistical difference between 2 distributions

Tags:

python

pandas

numpy

scipy

jxn

1 Answers

Ami Tavory

Related questions

Recent Activity

Donate For Us