Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OLS in Python with Dummy Variables - Best Solution?

I have a problem I am trying to solve in Python, and I have found multiple solutions (I think) but I am trying to figure out which one is the best. I am hoping to choose libraries that will be supported fully in the future so I do not have to re-write this service.

I want to do an ordinary multi-variate least squares regression with both categorical and continuous dependent variables. The code has to be written in Python, as it is being integrated into a web service. I have been following Pandas quite a bit but never used it, so this seems to be one approach:

SOLUTION 1. https://github.com/pydata/pandas/blob/master/examples/regressions.py

Obviously, numpy/scipy are ideal, but I cant find an example that uses dummy variables (does anyone have one???). I did find this though,

SOLUTION 2. http://www.scipy.org/Cookbook/OLS

which I could modify to support dummy variables, but I do not want to do that if someone else has done it already + I want the numbers to be very similar to R, as I have done most of my analysis offline and I can use these results for unit tests.

And in the example (2) above, I see that I could technically use rpy/rpy2, although that is not optimal because my web service requires yet another piece of technology (R). The good thing about using the interface is the numbers would be identical to my results from R.

SOLUTION 3. http://www.scipy.org/Cookbook/OLS (but using Rpy/Rpy2)

Anyways, I am interested in what everyone's approach would be out of these three solutions, if there are any I am missing ...... and if Panda's is mature enough to start using in a production web service. The key thing here is that I do not want to have to support/patch bug fixes or write anything from scratch if possible. I'm too busy and probably not smart enough :)

Thanks.

like image 694
josephmisiti Avatar asked Dec 07 '25 13:12

josephmisiti


1 Answers

You can use statsmodels, which provides many different models and result statistics

If you want to use an R like formula interface, here are some examples and you can look at the corresponding documentation :

http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/contrasts.html http://statsmodels.sourceforge.net/devel/examples/notebooks/generated/example_formulas.html

If you want a pure numpy version, then here is an old example that does everything from scratch http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html#ols-with-dummy-variables

The models are integrated with pandas, and can use pandas DataFrame as the data structure for the dependent and independent variables (endog and exog in statsmodels naming convention).

like image 179
Josef Avatar answered Dec 09 '25 03:12

Josef



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!