Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas sliding window over a dataframe column

Tags:

python

pandas

I am new to pandas. I am using pandas to read a CSV file of timestamped records, into a dataframe. The data has the following columns:

timestamp COLUMN_A COLUMN_B COLUMN_C

After reading the data into the dataframe, I want to be able to run a windowing function on COLUMN_C; The function should return the timestamped values of the column.

I have written something that works for iterables:

import collections
import itertools


def sliding_window_iter(iterable, size):
    """Iterate through iterable using a sliding window of several elements.

    Creates an iterable where each element is a tuple of `size`
    consecutive elements from `iterable`, advancing by 1 element each
    time. For example:

    >>> list(sliding_window_iter([1, 2, 3, 4], 2))
    [(1, 2), (2, 3), (3, 4)]
    """
    iterable = iter(iterable)
    window = collections.deque(
        itertools.islice(iterable, size-1),
        maxlen=size
    )
    for item in iterable:
        window.append(item)
        yield tuple(window)

How do I modify this to work on the column of a dataframe?

like image 250
Homunculus Reticulli Avatar asked Sep 17 '25 09:09

Homunculus Reticulli


1 Answers

It's simpler to successively slice through the dataframe. Since you want overlapping windows [(1, 2), (2, 3), (3, 4), ...], you could write it like this:

def sliding_window_iter(series, size):
    """series is a column of a dataframe"""
    for start_row in range(len(series) - size + 1):
        yield series[start_row:start_row + size]

Usage:

df = pd.DataFrame({'A': list(range(100, 501, 100)),
                   'B': list(range(-20, -15)),
                   'C': [0, 1, 2, None, 4]},
                  index=pd.date_range('2021-01-01', periods=5))

list(sliding_window_iter(df['C'], 2))

Output:

[2021-01-01    0.0
 2021-01-02    1.0
 Freq: D, Name: C, dtype: float64,
 2021-01-02    1.0
 2021-01-03    2.0
 Freq: D, Name: C, dtype: float64,
 2021-01-03    2.0
 2021-01-04    NaN
 Freq: D, Name: C, dtype: float64,
 2021-01-04    NaN
 2021-01-05    4.0
 Freq: D, Name: C, dtype: float64]

Also works if you pass in multiple columns:

list(sliding_window_iter(df.loc[:, ['A', 'C']], 2))

#output:
[              A    C
 2021-01-01  100  0.0
 2021-01-02  200  1.0,
               A    C
 2021-01-02  200  1.0
 2021-01-03  300  2.0,
               A    C
 2021-01-03  300  2.0
 2021-01-04  400  NaN,
               A    C
 2021-01-04  400  NaN
 2021-01-05  500  4.0]
like image 156
aneroid Avatar answered Sep 19 '25 23:09

aneroid