I have pretty good grasp of how to utilize pandas and numpy to vectorize operations on entire columns of data. However, I've come across a situation that I just can't seem to vectorize. When the calculation involves utilizing a value from a previous row to calculate the current row, I have to fall back to a for loop.
Is it possible to vectorize this kind of thing? Here is a simple example of what I mean:
# Test set of 20 random integers
df = pd.DataFrame({'base': [15, 16, 2, 16, 14,
1, 18, 18, 4, 7,
4, 18, 19, 13, 16,
11, 1, 8, 1, 9]})
# Empty array to hold calculated values
calc_data = np.empty((20, 1))
period = 14
for idx, value in enumerate(df.base):
# Seeding the first element of the calculated array
if idx == 0:
calc_data[idx] = 5
else:
calc_data[idx] = (calc_data[idx - 1] * (period - 1) + df.base.iloc[idx]) / period
# Adding the column to the dataframe
df['calculated'] = calc_data
print(df)
Output:
base calculated
0 15 5.000000
1 16 5.785714
2 2 5.515306
3 16 6.264213
4 14 6.816769
5 1 6.401286
6 18 7.229765
7 18 7.999068
8 4 7.713420
9 7 7.662461
10 4 7.400857
11 18 8.157939
12 19 8.932372
13 13 9.222916
14 16 9.706994
15 11 9.799351
16 1 9.170826
17 8 9.087196
18 1 8.509539
19 9 8.544572
One vectorized way (treating 'vectorized' as meaning 'avoiding Python-level loops') would be to treat it as a linear signal filter:
import numpy as np
import pandas as pd
import scipy.signal
def via_lfilter(arr):
period = 14
y0 = 5.0 # initial value
# calc_data[idx] = (calc_data[idx - 1] * (period - 1) + df.base.iloc[idx]) / period
b = [1.0/period] # coefficients of 'original' terms
a = [1.0, -(period-1)/period] # coefficients of 'computed' terms
zi = scipy.signal.lfiltic(b, a, [y0], x=arr[1::-1])
y = np.zeros_like(arr)
y[0] = y0
result = scipy.signal.lfilter(b, a, arr[1:], axis=0, zi=zi)
y[1:] = result[0]
return y
but in the real world, I'd just use numba, which is designed precisely to give us the performance benefits of vectorization without the headaches:
import numba
@numba.jit(nopython=True)
def via_numba(arr):
calc_data = np.zeros_like(arr)
period = 14
calc_data[0] = 5.0 # initial value
for idx in range(1, len(arr)):
calc_data[idx] = (calc_data[idx - 1] * (period - 1) + arr[idx]) / period
return calc_data
These give me:
In [238]: df["vect"] = via_lfilter(df.base.values.astype(float))
...: df["via_numba"] = via_numba(df.base.values.astype(float))
...:
...:
In [239]: df
Out[239]:
base calculated vect via_numba
0 15 5.000000 5.000000 5.000000
1 16 5.785714 5.785714 5.785714
2 2 5.515306 5.515306 5.515306
3 16 6.264213 6.264213 6.264213
4 14 6.816769 6.816769 6.816769
5 1 6.401286 6.401286 6.401286
6 18 7.229765 7.229765 7.229765
7 18 7.999068 7.999068 7.999068
8 4 7.713420 7.713420 7.713420
9 7 7.662461 7.662461 7.662461
10 4 7.400857 7.400857 7.400857
11 18 8.157939 8.157939 8.157939
12 19 8.932372 8.932372 8.932372
13 13 9.222916 9.222916 9.222916
14 16 9.706994 9.706994 9.706994
15 11 9.799351 9.799351 9.799351
16 1 9.170826 9.170826 9.170826
17 8 9.087196 9.087196 9.087196
18 1 8.509539 8.509539 8.509539
19 9 8.544572 8.544572 8.544572
and both behave reasonably at larger frames:
In [240]: df = pd.DataFrame({"base": np.random.uniform(1, 100, 10**6)})
In [241]: %timeit via_lfilter(df.base.values.astype(float))
11.4 ms ± 49.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [242]: %timeit via_numba(df.base.values.astype(float))
11 ms ± 342 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The following is vectorized in the sense that all operations used are array operations at the pandas & numpy layer.
X = ((period-1)/period) ** np.arange(len(df)) / period
a = df.base.copy()
a.loc[0] = 5*period
df['calculated'] = a.expanding().apply(lambda x: np.sum(x * X[:len(x)][::-1]), raw=True)
a fast solution can be constructed by extracting the sequential nature of the recursion.
i.e. note that each element of the result follow a certain pattern:
0: 5
1: 5 (13/14) + 16 (1/14)
2: 5 (13 / 14)^2 + 16 (13 / 14^2) + 2 (1/14)
...
if the first element is multiplied by 14, then we can represent the above as
0: sum{(1/14)*[70]}
1: sum{(1/14)*[70(13/14), 16]}
2: sum{(1/14)*[70(13/14)^2, 16(13/14), 2]}
...
If we remove the elements from df.base
we get the series which can be summed:
0: (1/14) * [1]
1: (1/14) * [(13/14), 1]
2: (1/14) * [(13/14)^2, (13/14), 1]
...
This sequence of series above can be gotten as reversed slices of the following:
X = ((period-1)/period) ** np.arange(len(df)) / period
Also note that the first value of df.base
is not used in the construction of calculated
. Instead it is replaced by (5*period = 70)
So, the nth result is the sum of expanded series of modified df.base
times the appropriate slice of X
a = df.base.copy()
a.loc[0] = 5*period
df['calculated'] = a.expanding().apply(lambda x: np.sum(x * X[:len(x)][::-1]), raw=True)
# df outputs:
base calculated
0 15 5.000000
1 16 5.785714
2 2 5.515306
3 16 6.264213
4 14 6.816769
5 1 6.401286
6 18 7.229765
7 18 7.999068
8 4 7.713420
9 7 7.662461
10 4 7.400857
11 18 8.157939
12 19 8.932372
13 13 9.222916
14 16 9.706994
15 11 9.799351
16 1 9.170826
17 8 9.087196
18 1 8.509539
19 9 8.544572
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With