Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Any way in statsmodels to obtain final observations used in a regression?

Very frequently, regressions will drop some observations because they are missing one or more regressor fields. For example:

In [30]: len(df)  #df is our dataframe
Out[30]: 39243

In [31]: model = sm.OLS(df[var_name_y], df[var_names_x], missing="drop")
         result = model.fit()

In [32]: len(result.fittedvalues)
Out[32]: 38013

Here, we dropped 1230 observations, each of which was missing one or more regressors.

Is there any way to get access to the DataFrame that was actually used in the regrssion - that is, the smaller one of size 38013 that remains after the regression dropped the missing observations? This is available, for example, in the various SAS regression routines. I have been combing the API but am unable to locate anything. I need this data to produce various diagnostics based on the actual data used in the regression.

Of course, I could drop the correct rows myself before the regression, like this:

In [58]: len(df)
Out[58]: 39243

In [59]: df2 = df.dropna(subset=var_name_y + var_names_x)
In [60]: len(df2)
Out[60]: 38013

In [64]: model = sm.OLS(df2[var_name_y], df2[var_names_x],missing="drop")
         result = model.fit()
In [65]: len(result.fittedvalues)
Out[65]: 38013

Then the DataFrame that I feed to the regression is already the one with all the missing observations removed. But I was hoping to avoid that, particularly if I am working with a much larger dataset. Is there a better way to do this, particularly programmatically accessing the post-regression DataFrame via the OLS model class or the RegressionResultsWrapper output of the fit?

like image 410
sparc_spread Avatar asked Jan 29 '26 17:01

sparc_spread


1 Answers

The cleanest approach to get what you ask seems to rely on model.data, as suggested by user333700 in the comments. Differently from what user333700 states, then, model.data seems to have deliberately exposed as public interface by statsmodels developers.

In particular, model.data.missing_row_idx provides what you ask since 2012... so although undocumented, it seems relatively stable.

Example:

In [3]: model = OLS(pd.DataFrame([[1, 2], [3, 4], [5, float('nan')]]), [2, 5, 4], missing='drop')

In [4]: model.data.missing_row_idx
Out[4]: [2]
like image 92
Pietro Battiston Avatar answered Feb 01 '26 05:02

Pietro Battiston



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!