Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is brute force the best option for multiple regression using Python?

In the linear model 𝑦 = 𝑎0 + 𝑎1 × 𝑥i + 𝑎2 × 𝑥j + 𝑎3 × 𝑥k + 𝜖 , what values for 𝑖,j,k ∈ [1,100] results in the model with the highest R-Squared?

The data set consists of 100 independent variables and one dependent variable. Each variable has 50 observations.

My only guess is to loop through all possible combinations of three variables and compare R-squared for each combination. The way I have done it with Python is:

import itertools as itr
import pandas as pd
import time as t
from sklearn import linear_model as lm

start = t.time()

#linear regression model 
LR = lm.LinearRegression()

#import data
data = pd.read_csv('csv_file')

#all possible combinations of three variables
combs = [comb for comb in itr.combinations(range(1, 101), 3)]

target = data.iloc[:,0]
hi_R2 = 0

for comb in combs:
    variables = data.iloc[:, comb]
    R2 = LR.fit(variables, target).score(variables, target)
    if R2 > hi_R2:
        hi_R2 = R2
        indices = comb
end = t.time()
time = float((end-start)/60)

print 'Variables: {}\nR2 = {:.2f}\nTime: {:.1f} mins'.format(indices, hi_R2, time)

It took 4.3 mins to complete. I believe this method is not efficient for data set with thousands observations for each variable. What method would you suggest instead?

Thank you.

like image 460
antdro Avatar asked May 21 '26 19:05

antdro


1 Answers

Exhaustive search is going to be the slowest way of doing this

The fastest way to do this is mentioned in one of the comments. You should pre-specify your model based on theory/intuition/logic and come up with a set of variables that you hypothesize will be good predictors of your outcome.

The difference between the 2 extremes is that exhaustive search may leave you with a model that doesn't make sense as it will use whatever variables it has access to, even if its completely unrelated to your question of interest

If, however, you dont want to specify a model and still want to use an automated technique to build the "best" model, a middle ground might be something like stepwise regression

There are a few different ways of doing this (e.g. forward/backward elimination), but in the case of forward selection, for example, you start by adding in one variable at a time and testing the coefficient for significance. If the variables improves model fit (either determined throught he individual regression coefficient, or the R2 of the model) you keep it and add another. If it doesnt aid prediction then you throw it away. Repeat this process until you've found your best predictors

like image 53
Simon Avatar answered May 24 '26 09:05

Simon