Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R-style formulas when implementing a power (i.e. square) in a GLM misbehaves

In the python code below, the glm model specification does not include the third power in the in model1 but it does in model2:

model1 = glm(formula="wage ~ workhours + workhours**3           + C(gender)", data=df, family=sm.families.Gaussian())
model2 = glm(formula="wage ~ workhours + np.power(workhours, 3) + C(gender)", data=df, family=sm.families.Gaussian())

Is this a bug? According to the documentation **x raises something to the power 3.

like image 990
Xtiaan Avatar asked Dec 13 '25 13:12

Xtiaan


1 Answers

** in a formula is treated as a formula operator, not as regular exponentiation. (This is similar to how ^ works in an R formula.)

(a+b+c+d)**3 means that the model should include a, b, c, d, and all interactions between these variables up to 3rd order.

workhours**3 means that the model should include workhours and all interactions between... just workhours... up to 3rd order... but there are no such interaction terms, so it's equivalent to just workhours.

In contrast, np.power(workhours, 3) is treated as Python code, and computes the power you wanted.

statsmodels uses patsy for formula handling, so for full details on the formula language, you can check the patsy docs.

like image 102
user2357112 supports Monica Avatar answered Dec 16 '25 09:12

user2357112 supports Monica



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!