Very frequently, regressions will drop some observations because they are missing one or more regressor fields. For example:
In [30]: len(df) #df is our dataframe
Out[30]: 39243
In [31]: model = sm.OLS(df[var_name_y], df[var_names_x], missing="drop")
result = model.fit()
In [32]: len(result.fittedvalues)
Out[32]: 38013
Here, we dropped 1230 observations, each of which was missing one or more regressors.
Is there any way to get access to the DataFrame that was actually used in the regrssion - that is, the smaller one of size 38013 that remains after the regression dropped the missing observations? This is available, for example, in the various SAS regression routines. I have been combing the API but am unable to locate anything. I need this data to produce various diagnostics based on the actual data used in the regression.
Of course, I could drop the correct rows myself before the regression, like this:
In [58]: len(df)
Out[58]: 39243
In [59]: df2 = df.dropna(subset=var_name_y + var_names_x)
In [60]: len(df2)
Out[60]: 38013
In [64]: model = sm.OLS(df2[var_name_y], df2[var_names_x],missing="drop")
result = model.fit()
In [65]: len(result.fittedvalues)
Out[65]: 38013
Then the DataFrame that I feed to the regression is already the one with all the missing observations removed. But I was hoping to avoid that, particularly if I am working with a much larger dataset. Is there a better way to do this, particularly programmatically accessing the post-regression DataFrame via the OLS model class or the RegressionResultsWrapper output of the fit?
The cleanest approach to get what you ask seems to rely on model.data, as suggested by user333700 in the comments. Differently from what user333700 states, then, model.data seems to have deliberately exposed as public interface by statsmodels developers.
In particular, model.data.missing_row_idx provides what you ask since 2012... so although undocumented, it seems relatively stable.
Example:
In [3]: model = OLS(pd.DataFrame([[1, 2], [3, 4], [5, float('nan')]]), [2, 5, 4], missing='drop')
In [4]: model.data.missing_row_idx
Out[4]: [2]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With