Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Statsmodel Multiple Linear Regression Error - Python

I am running (what I think is) as fairly straightforward multiple linear regression model fit using Stats model.

My code is as follows:

y = 'EXITS|20:00:00'
all_columns = "+".join(y_2015piv.columns - ['EXITS|20:00:00'])
reg_formula = "y~" + all_columns

lm= smf.ols(formula=reg_formula, data=y_2015piv).fit()

Because I have about 30 factor variables I'm creating the formula using Python string manipulation. "y" is as presented above. all_columns is the dataframe y_2015piv columns without "y".

This is all_columns:

DAY_Fri+DAY_Mon+DAY_Sat+DAY_Sun+DAY_Thu+DAY_Tue+DAY_Wed+ENTRIES|00:00:00+ENTRIES|04:00:00+ENTRIES|08:00:00+ENTRIES|12:00:00+ENTRIES|16:00:00+ENTRIES|20:00:00+EXITS|00:00:00+EXITS|04:00:00+EXITS|08:00:00+EXITS|12:00:00+EXITS|16:00:00+MONTH_Apr+MONTH_Aug+MONTH_Dec+MONTH_Feb+MONTH_Jan+MONTH_Jul+MONTH_Jun+MONTH_Mar+MONTH_May+MONTH_Nov+MONTH_Oct+MONTH_Sep

The values in the dataframe are continuous numerical variables and 0/1 dummy variables.

When I try and fit the model I get this error:

PatsyError: numbers besides '0' and '1' are only allowed with **
    y~DAY_Fri+DAY_Mon+DAY_Sat+DAY_Sun+DAY_Thu+DAY_Tue+DAY_Wed+ENTRIES|00:00:00+ENTRIES|04:00:00+ENTRIES|08:00:00+ENTRIES|12:00:00+ENTRIES|16:00:00+ENTRIES|20:00:00+EXITS|00:00:00+EXITS|04:00:00+EXITS|08:00:00+EXITS|12:00:00+EXITS|16:00:00+MONTH_Apr+MONTH_Aug+MONTH_Dec+MONTH_Feb+MONTH_Jan+MONTH_Jul+MONTH_Jun+MONTH_Mar+MONTH_May+MONTH_Nov+MONTH_Oct+MONTH_Sep

There is nothing on line that addresses what this could be. Any help appreciated.

By the way, when I fit this model in Scikit-learn it works fine. So I figure the data is in order.

Thanks in advance.

like image 387
Windstorm1981 Avatar asked Oct 16 '25 02:10

Windstorm1981


2 Answers

The first error that I got was this:

PatsyError: numbers besides '0' and '1' are only allowed with **
Temp ~ MEI+ CO2+ CH4+ N2O+ CFC-11+ CFC-12+ TSI+ Aerosols
                               ^^

According to this link: http://patsy.readthedocs.io/en/latest/builtins-reference.html#patsy.builtins.Q you can use Q("var") in the formula to get rid of the error. I was getting the same error but it was solved.

linMod = smf.ols('Temp ~ MEI+ CO2+ CH4+ N2O+ Q("CFC-11")+ Q("CFC-12")+ TSI+ Aerosols',data = trainingSet).fit()

this is the solved line of code. I had tried

linMod = smf.ols('Temp ~ MEI+ CO2+ CH4+ N2O+ Q("CFC-11 + CFC-12")+ TSI+ Aerosols',data = trainingSet).fit()

but this did not work. It seems that when using formula, the numbers and variables happen to have certain meaning that does not let the use of certain names. in my case error was:

PatsyError: Error evaluating factor: NameError: no data named 'CFC-11+ CFC-12' found
Temp ~ MEI+ CO2+ CH4+ N2O+ Q("CFC-11+ CFC-12")+ TSI+ Aerosols
                           ^^^^^^^^^^^^^^^^^^^
like image 163
Learner Avatar answered Oct 18 '25 16:10

Learner


patsy is handling the formula parsing and is parsing the string and interpreting it as formula with the given syntax. So some elements in the string are not allowed because they are part of the formula syntax. To keep them as names, patsy also has a code for taking the names as literal text Q which should work in this case http://patsy.readthedocs.io/en/latest/builtins-reference.html#patsy.builtins.Q

Otherwise, if you already have the full design matrix with all the dummy variables, then there is no reason to go through the formula interface. Using the direct interface with pandas DataFrames or numpy arrays:

sm.OLS(y, x)

will ignore any names of DataFrame columns except for using it as strings in the summary table. Variable/column names are also used as one way of defining restrictions for t_test but those go also through patsy and I am not sure it works with special characters in the names.

like image 42
Josef Avatar answered Oct 18 '25 15:10

Josef



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!