Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Pandas find statistical difference between 2 distributions

i have 2 columns with similar data. I plot them to compare their distributions and i want to quantify their difference.

df = pd.DataFrame({'a':['cat','dog','bird','cat','dog','dog','dog'],
             'b':['cat','cat','cat','bird','dog','dog','dog']})

I then plot the 2 columns of my data frame to compare their distributions:

ax = df['a'].value_counts().plot(kind='bar', color='blue', width=.75, legend=True, alpha=0.8)
df['b'].value_counts().plot(kind='bar', color='maroon', width=.5, alpha=1, legend=True)

enter image description here

How can i quantify the difference in the distributions statistically to say how similar they are?

would it be a simple t-test or something else?

like image 347
jxn Avatar asked May 21 '26 13:05

jxn


1 Answers

It is very common to use the two-sided Kolmogorov-Smirnov test for this.

In Python, you can do so with scipy.stats.ks_2samp:

from scipy import stats

merged = pd.merge(
    df.a.value_counts().to_frame(),
    df.b.value_counts().to_frame(),
    left_index=True,
    right_index=True)

stats.ks_2samp(merged.a, merged.b)

Broadly speaking, if the second value of the returned tuple is small (say less than 0.05), you should reject the hypothesis that the distributions are the same.

like image 167
Ami Tavory Avatar answered May 24 '26 01:05

Ami Tavory