Any way in statsmodels to obtain final observations used in a regression?

Question

Very frequently, regressions will drop some observations because they are missing one or more regressor fields. For example:

In [30]: len(df)  #df is our dataframe
Out[30]: 39243

In [31]: model = sm.OLS(df[var_name_y], df[var_names_x], missing="drop")
         result = model.fit()

In [32]: len(result.fittedvalues)
Out[32]: 38013

Here, we dropped 1230 observations, each of which was missing one or more regressors.

Is there any way to get access to the DataFrame that was actually used in the regrssion - that is, the smaller one of size 38013 that remains after the regression dropped the missing observations? This is available, for example, in the various SAS regression routines. I have been combing the API but am unable to locate anything. I need this data to produce various diagnostics based on the actual data used in the regression.

Of course, I could drop the correct rows myself before the regression, like this:

In [58]: len(df)
Out[58]: 39243

In [59]: df2 = df.dropna(subset=var_name_y + var_names_x)
In [60]: len(df2)
Out[60]: 38013

In [64]: model = sm.OLS(df2[var_name_y], df2[var_names_x],missing="drop")
         result = model.fit()
In [65]: len(result.fittedvalues)
Out[65]: 38013

Then the DataFrame that I feed to the regression is already the one with all the missing observations removed. But I was hoping to avoid that, particularly if I am working with a much larger dataset. Is there a better way to do this, particularly programmatically accessing the post-regression DataFrame via the OLS model class or the RegressionResultsWrapper output of the fit?

Pietro Battiston · Accepted Answer

The cleanest approach to get what you ask seems to rely on model.data, as suggested by user333700 in the comments. Differently from what user333700 states, then, model.data seems to have deliberately exposed as public interface by statsmodels developers.

In particular, model.data.missing_row_idx provides what you ask since 2012... so although undocumented, it seems relatively stable.

Example:

In [3]: model = OLS(pd.DataFrame([[1, 2], [3, 4], [5, float('nan')]]), [2, 5, 4], missing='drop')

In [4]: model.data.missing_row_idx
Out[4]: [2]

Any way in statsmodels to obtain final observations used in a regression?

Tags:

python

pandas

statsmodels

sparc_spread

1 Answers

Pietro Battiston

Recent Activity

Donate For Us

Any way in statsmodels to obtain final observations used in a regression?

Tags:

python

pandas

statsmodels

sparc_spread

1 Answers

Pietro Battiston

Related questions

Recent Activity

Donate For Us