Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas or Polars: find index of previous element larger than current one

Suppose my data looks like this:

data = {
    'value': [1,9,6,7,3, 2,4,5,1,9]
}

For each row, I would like to find the row number of the latest previous element larger than the current one.

So, my expected output is:

[None, 0, 1, 2, 1, 1, 3, 4, 1, 0]
  • the first element 1 has no previous element, so I want None in the result
  • the next element 9 is at least as large than all its previous elements, so I want 0 in the result
  • the next element 6, has its previous element 9 which is larger than it. The distance between them is 1. So, I want 1 in the result here.

I'm aware that I can do this in a loop in Python (or in C / Rust if I write an extension).

My question: is it possible to solve this using entirely dataframe operations? pandas or Polars, either is fine. But only dataframe operations.

So, none of the following please:

  • apply
  • map_elements
  • map_rows
  • iter_rows
  • Python for loops which loop over the rows and extract elements one-by-one from the dataframes
like image 971
ignoring_gravity Avatar asked Dec 06 '25 08:12

ignoring_gravity


1 Answers

It's hard to vectorize these kind of problems, but you can use numba module to speed-up the task. Also this problem can be parallelized very easily:

from numba import njit, prange

@njit(parallel=True)
def get_values(values):
    out = np.zeros_like(values, dtype=np.float64)

    for i in prange(len(values)):
        idx = np.int64(i)
        v = values[idx]

        while idx > -1 and values[idx] <= v:
            idx -= 1

        if idx > -1:
            out[i] = i - idx

    out[0] = np.nan
    return out

data = {
    "value": [1, 9, 6, 7, 3, 2, 4, 5, 1, 9],
    "out": [None, 0, 1, 2, 1, 1, 3, 4, 1, 0],
}
df = pd.DataFrame(data)

df["out2"] = get_values(df["value"].values)
print(df)

Prints:

   value  out  out2
0      1  NaN   NaN
1      9  0.0   0.0
2      6  1.0   1.0
3      7  2.0   2.0
4      3  1.0   1.0
5      2  1.0   1.0
6      4  3.0   3.0
7      5  4.0   4.0
8      1  1.0   1.0
9      9  0.0   0.0

Benchmark (with 1_000_000 items from 1-100):

from timeit import timeit

data = {
    "value": np.random.randint(1, 100, size=1_000_000),
}
df = pd.DataFrame(data)

t = timeit('df["out"] = get_values(df["value"].values)', globals=globals(), number=1)
print(t)

Prints on my machine (AMD 5700x):

0.3559090679627843
like image 53
Andrej Kesely Avatar answered Dec 08 '25 22:12

Andrej Kesely



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!