Python vectorized operation involving data from a previous row

Question

I have pretty good grasp of how to utilize pandas and numpy to vectorize operations on entire columns of data. However, I've come across a situation that I just can't seem to vectorize. When the calculation involves utilizing a value from a previous row to calculate the current row, I have to fall back to a for loop.

Is it possible to vectorize this kind of thing? Here is a simple example of what I mean:

# Test set of 20 random integers
df = pd.DataFrame({'base': [15, 16, 2, 16, 14,
                            1, 18, 18, 4, 7,
                            4, 18, 19, 13, 16,
                            11, 1, 8, 1, 9]})


# Empty array to hold calculated values
calc_data = np.empty((20, 1))

period = 14

for idx, value in enumerate(df.base):

    # Seeding the first element of the calculated array
    if idx == 0:
        calc_data[idx] = 5

    else:
        calc_data[idx] = (calc_data[idx - 1] * (period - 1) + df.base.iloc[idx]) / period

# Adding the column to the dataframe
df['calculated'] = calc_data

print(df)

Output:

    base  calculated
0     15    5.000000
1     16    5.785714
2      2    5.515306
3     16    6.264213
4     14    6.816769
5      1    6.401286
6     18    7.229765
7     18    7.999068
8      4    7.713420
9      7    7.662461
10     4    7.400857
11    18    8.157939
12    19    8.932372
13    13    9.222916
14    16    9.706994
15    11    9.799351
16     1    9.170826
17     8    9.087196
18     1    8.509539
19     9    8.544572

DSM · Accepted Answer

One vectorized way (treating 'vectorized' as meaning 'avoiding Python-level loops') would be to treat it as a linear signal filter:

import numpy as np
import pandas as pd
import scipy.signal

def via_lfilter(arr):
    period = 14
    y0 = 5.0  # initial value

    # calc_data[idx] = (calc_data[idx - 1] * (period - 1) + df.base.iloc[idx]) / period
    b = [1.0/period]  # coefficients of 'original' terms
    a = [1.0, -(period-1)/period]  # coefficients of 'computed' terms

    zi = scipy.signal.lfiltic(b, a, [y0], x=arr[1::-1])

    y = np.zeros_like(arr)
    y[0] = y0
    result = scipy.signal.lfilter(b, a, arr[1:], axis=0, zi=zi)
    y[1:] = result[0]

    return y

but in the real world, I'd just use numba, which is designed precisely to give us the performance benefits of vectorization without the headaches:

import numba

@numba.jit(nopython=True)
def via_numba(arr):
    calc_data = np.zeros_like(arr)
    period = 14
    calc_data[0] = 5.0  # initial value
    for idx in range(1, len(arr)):
        calc_data[idx] = (calc_data[idx - 1] * (period - 1) + arr[idx]) / period
    return calc_data

These give me:

In [238]: df["vect"] = via_lfilter(df.base.values.astype(float))
     ...: df["via_numba"] = via_numba(df.base.values.astype(float))
     ...: 
     ...: 

In [239]: df
Out[239]: 
    base  calculated      vect  via_numba
0     15    5.000000  5.000000   5.000000
1     16    5.785714  5.785714   5.785714
2      2    5.515306  5.515306   5.515306
3     16    6.264213  6.264213   6.264213
4     14    6.816769  6.816769   6.816769
5      1    6.401286  6.401286   6.401286
6     18    7.229765  7.229765   7.229765
7     18    7.999068  7.999068   7.999068
8      4    7.713420  7.713420   7.713420
9      7    7.662461  7.662461   7.662461
10     4    7.400857  7.400857   7.400857
11    18    8.157939  8.157939   8.157939
12    19    8.932372  8.932372   8.932372
13    13    9.222916  9.222916   9.222916
14    16    9.706994  9.706994   9.706994
15    11    9.799351  9.799351   9.799351
16     1    9.170826  9.170826   9.170826
17     8    9.087196  9.087196   9.087196
18     1    8.509539  8.509539   8.509539
19     9    8.544572  8.544572   8.544572

and both behave reasonably at larger frames:

In [240]: df = pd.DataFrame({"base": np.random.uniform(1, 100, 10**6)})

In [241]: %timeit via_lfilter(df.base.values.astype(float))
11.4 ms ± 49.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [242]: %timeit via_numba(df.base.values.astype(float))
11 ms ± 342 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Haleemur Ali · Answer

tldr:

The following is vectorized in the sense that all operations used are array operations at the pandas & numpy layer.

X = ((period-1)/period) ** np.arange(len(df)) / period
a = df.base.copy()
a.loc[0] = 5*period
df['calculated'] = a.expanding().apply(lambda x: np.sum(x * X[:len(x)][::-1]), raw=True)

explanation:

a fast solution can be constructed by extracting the sequential nature of the recursion.

i.e. note that each element of the result follow a certain pattern:

0: 5
1: 5 (13/14) + 16 (1/14)
2: 5 (13 / 14)^2 + 16 (13 / 14^2) + 2 (1/14)
...

if the first element is multiplied by 14, then we can represent the above as

0: sum{(1/14)*[70]}
1: sum{(1/14)*[70(13/14), 16]}
2: sum{(1/14)*[70(13/14)^2, 16(13/14), 2]}
...

If we remove the elements from df.base we get the series which can be summed:

0: (1/14) * [1]
1: (1/14) * [(13/14), 1]
2: (1/14) * [(13/14)^2, (13/14), 1]
...

This sequence of series above can be gotten as reversed slices of the following:

X = ((period-1)/period) ** np.arange(len(df)) / period

Also note that the first value of df.base is not used in the construction of calculated. Instead it is replaced by (5*period = 70)

So, the nth result is the sum of expanded series of modified df.base times the appropriate slice of X

a = df.base.copy()
a.loc[0] = 5*period
df['calculated'] = a.expanding().apply(lambda x: np.sum(x * X[:len(x)][::-1]), raw=True)
# df outputs:
    base  calculated
0     15    5.000000
1     16    5.785714
2      2    5.515306
3     16    6.264213
4     14    6.816769
5      1    6.401286
6     18    7.229765
7     18    7.999068
8      4    7.713420
9      7    7.662461
10     4    7.400857
11    18    8.157939
12    19    8.932372
13    13    9.222916
14    16    9.706994
15    11    9.799351
16     1    9.170826
17     8    9.087196
18     1    8.509539
19     9    8.544572

Python vectorized operation involving data from a previous row

Tags:

python

pandas

vectorization

numpy

Adam Klaum

2 Answers

DSM

tldr:

explanation:

Haleemur Ali

Recent Activity

Donate For Us

Python vectorized operation involving data from a previous row

Tags:

python

pandas

vectorization

numpy

Adam Klaum

2 Answers

DSM

tldr:

explanation:

Haleemur Ali

Related questions

Recent Activity

Donate For Us