Suppose my data looks like this:
data = {
'value': [1,9,6,7,3, 2,4,5,1,9]
}
For each row, I would like to find the row number of the latest previous element larger than the current one.
So, my expected output is:
[None, 0, 1, 2, 1, 1, 3, 4, 1, 0]
1 has no previous element, so I want None in the result9 is at least as large than all its previous elements, so I want 0 in the result6, has its previous element 9 which is larger than it. The distance between them is 1. So, I want 1 in the result here.I'm aware that I can do this in a loop in Python (or in C / Rust if I write an extension).
My question: is it possible to solve this using entirely dataframe operations? pandas or Polars, either is fine. But only dataframe operations.
So, none of the following please:
applymap_elementsmap_rowsiter_rowsIt's hard to vectorize these kind of problems, but you can use numba module to speed-up the task. Also this problem can be parallelized very easily:
from numba import njit, prange
@njit(parallel=True)
def get_values(values):
out = np.zeros_like(values, dtype=np.float64)
for i in prange(len(values)):
idx = np.int64(i)
v = values[idx]
while idx > -1 and values[idx] <= v:
idx -= 1
if idx > -1:
out[i] = i - idx
out[0] = np.nan
return out
data = {
"value": [1, 9, 6, 7, 3, 2, 4, 5, 1, 9],
"out": [None, 0, 1, 2, 1, 1, 3, 4, 1, 0],
}
df = pd.DataFrame(data)
df["out2"] = get_values(df["value"].values)
print(df)
Prints:
value out out2
0 1 NaN NaN
1 9 0.0 0.0
2 6 1.0 1.0
3 7 2.0 2.0
4 3 1.0 1.0
5 2 1.0 1.0
6 4 3.0 3.0
7 5 4.0 4.0
8 1 1.0 1.0
9 9 0.0 0.0
Benchmark (with 1_000_000 items from 1-100):
from timeit import timeit
data = {
"value": np.random.randint(1, 100, size=1_000_000),
}
df = pd.DataFrame(data)
t = timeit('df["out"] = get_values(df["value"].values)', globals=globals(), number=1)
print(t)
Prints on my machine (AMD 5700x):
0.3559090679627843
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With