In the linear model đŚ = đ0 + đ1 Ă đĽi + đ2 Ă đĽj + đ3 Ă đĽk + đ , what values for đ,j,k â [1,100] results in the model with the highest R-Squared?
The data set consists of 100 independent variables and one dependent variable. Each variable has 50 observations.
My only guess is to loop through all possible combinations of three variables and compare R-squared for each combination. The way I have done it with Python is:
import itertools as itr
import pandas as pd
import time as t
from sklearn import linear_model as lm
start = t.time()
#linear regression model
LR = lm.LinearRegression()
#import data
data = pd.read_csv('csv_file')
#all possible combinations of three variables
combs = [comb for comb in itr.combinations(range(1, 101), 3)]
target = data.iloc[:,0]
hi_R2 = 0
for comb in combs:
variables = data.iloc[:, comb]
R2 = LR.fit(variables, target).score(variables, target)
if R2 > hi_R2:
hi_R2 = R2
indices = comb
end = t.time()
time = float((end-start)/60)
print 'Variables: {}\nR2 = {:.2f}\nTime: {:.1f} mins'.format(indices, hi_R2, time)
It took 4.3 mins to complete. I believe this method is not efficient for data set with thousands observations for each variable. What method would you suggest instead?
Thank you.
Exhaustive search is going to be the slowest way of doing this
The fastest way to do this is mentioned in one of the comments. You should pre-specify your model based on theory/intuition/logic and come up with a set of variables that you hypothesize will be good predictors of your outcome.
The difference between the 2 extremes is that exhaustive search may leave you with a model that doesn't make sense as it will use whatever variables it has access to, even if its completely unrelated to your question of interest
If, however, you dont want to specify a model and still want to use an automated technique to build the "best" model, a middle ground might be something like stepwise regression
There are a few different ways of doing this (e.g. forward/backward elimination), but in the case of forward selection, for example, you start by adding in one variable at a time and testing the coefficient for significance. If the variables improves model fit (either determined throught he individual regression coefficient, or the R2 of the model) you keep it and add another. If it doesnt aid prediction then you throw it away. Repeat this process until you've found your best predictors
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With