In R, I use the phyper function to do a hypergeometric test for bioinformatics analysis. However I use a lot of Python code and using rpy2 here is quite slow. So, I started looking for alternatives. It seemed that scipy.stats.hypergeom had something similar.
Currently, I call phyper like this:
pvalue <- 1-phyper(45, 92, 7518, 1329)
where 45 is the number of selected items having the property of interest, 92 the number of total items having the property, 7518 the number of non selected items not having the property, and 1329 the total number of selected items.
In R, this yields 6.92113e-13.
Attempting to do the same with scipy.stats.hypergeom however yields a completely different result (notice, the numbers are swapped because the function accepts numbers in a different way):
import scipy.stats as stats   
pvalue = 1-stats.hypergeom.cdf(45, 7518, 92. 1329)
print pvalue
However this returns -7.3450134863151106e-12, which makes little sense. Notice that I've tested this on other data and I had little issues (same precision up to the 4th decimal, which is enough for me).
So it boils down to these possibilities:
In case of "1", are there other alternatives to phyper that can be used in Python?
EDIT: As the comments have noted, this is a bug in scipy, fixed in git master.
From the docs, you could try:
hypergeom.sf(x,M,n,N,loc=0): survival function (1-cdf — sometimes more accurate)
Also, I think you might have the values mixed up.
Models drawing objects from a bin. M is total number of objects, n is total number of Type I objects. RV counts number of Type I objects in N drawn without replacement from population.
Therefore, by my reading: x=q, M=n+m, n=m, N=k.
So I would try:
stats.hypergeom.sf(45,(92+7518),92,1329)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With