Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ordinary Least Squares Regression for multiple columns in Pandas Dataframe

I'm trying to find a way to iterate code for a linear regression over many many columns, upwards of Z3. Here is a snippet of the dataframe called df1

    Time    A1      A2      A3      B1      B2      B3
1   1.00    6.64    6.82    6.79    6.70    6.95    7.02
2   2.00    6.70    6.86    6.92    NaN     NaN     NaN
3   3.00    NaN     NaN     NaN     7.07    7.27    7.40
4   4.00    7.15    7.26    7.26    7.19    NaN     NaN
5   5.00    NaN     NaN     NaN     NaN     7.40    7.51
6   5.50    7.44    7.63    7.58    7.54    NaN     NaN 
7   6.00    7.62    7.86    7.71    NaN     NaN     NaN

This code returns the slope coefficient of a linear regression for the very ONE column only and concatenates the value to a numpy series called series, here is what it looks like for extracting the slope for the first column:

from sklearn.linear_model import LinearRegression

series = np.array([]) #blank list to append result

df2 = df1[~np.isnan(df1['A1'])] #removes NaN values for each column to apply sklearn function
df3 = df2[['Time','A1']]
npMatrix = np.matrix(df3)
X, Y = npMatrix[:,0], npMatrix[:,1]
slope = LinearRegression().fit(X,Y) # either this or the next line
m = slope.coef_[0]

series= np.concatenate((SGR_trips, m), axis = 0)

As it stands now, I am using this slice of code, replacing "A1" with a new column name all the way up to "Z3" and this is extremely inefficient. I know there are many easy way to do this with some modules but I have the drawback of having all these intermediate NaN values in the timeseries so it seems like I'm limited to this method, or something like it.

I tried using a for loop such as:

for col in df1.columns:

and replacing 'A1', for example with col in the code, but this does not seem to be working.

Is there any way I can do this more efficiently?

Thank you!

like image 573
RageQuilt Avatar asked Oct 19 '25 12:10

RageQuilt


2 Answers

One liner (or three)

time = df[['Time']]
pd.DataFrame(np.linalg.pinv(time.T.dot(time)).dot(time.T).dot(df.fillna(0)),
             ['Slope'], df.columns)

enter image description here

Broken down with a bit of explanation

Using the closed form of OLS

enter image description here

In this case X is time where we define time as df[['Time']]. I used the double brackets to preserve the dataframe and its two dimensions. If I'd done single brackets, I'd have gotten a series and its one dimension. Then the dot products aren't as pretty.

enter image description here

is np.linalg.pinv(time.T.dot(time)).dot(time.T)

Y is df.fillna(0). Yes, we could have done one column at a time, but why when we could do it altogether. You have to deal with the NaNs. How would you imagine dealing with them? Only doing it over the time you had data? That is equivalent to placing zeroes in the NaN spots. So, I did.

Finally, I use pd.DataFrame(stuff, ['Slope'], df.columns) to contain all slopes in one place with the original labels.

Note that I calculated the slope of the regression for Time against itself. Why not? It was there. Its value is 1.0. Great! I probably did it right!

like image 99
piRSquared Avatar answered Oct 21 '25 02:10

piRSquared


Looping is a decent strategy for a modest number (say, fewer than thousands) of columns. Without seeing your implementation, I can't say what's wrong, but here's my version, which works:

slopes = []

for c in cols:
    if c=="Time": break
    mask = ~np.isnan(df1[c])
    x = np.atleast_2d(df1.Time[mask].values).T
    y = np.atleast_2d(df1[c][mask].values).T
    reg = LinearRegression().fit(x, y)
    slopes.append(reg.coef_[0])

I've simplified your code a bit to avoid creating so many temporary DataFrame objects, but it should work fine your way too.

like image 28
Nat Wilson Avatar answered Oct 21 '25 02:10

Nat Wilson



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!