Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Group based time difference analysis for large dataset with Pandas

I have a large dataset (roughly 250k rows) with property ID numbers and survey response datetimes. I'm trying to build some kind of basic fraud detection and one of the factors I'm considering is response time between different surveys for each property/store. If there is less than 10 minutes between any surveys for a particular location, I want it flagged as TRUE under a column called "close_response".

I have nested for loops within a function that is applied to each group that achieves what I want, but it's computationally expensive and time consuming when applied to a very large dataset.

import pandas as pd
from datetime import timedelta

# Sample DataFrame
data = {'property_id': [1, 1, 1, 2, 2, 3, 3],
        'response_datetime': ['2023-10-19 08:00:00', '2023-10-19 08:05:00', '2023-10-19 08:25:00', '2023-10-19 09:00:00', '2023-10-19 09:15:00', '2023-10-19 10:00:00', '2023-10-19 11:00:00']}
df = pd.DataFrame(data)
df['response_datetime'] = pd.to_datetime(df['response_datetime'])

def mark_close_responses(group):
    group['close_response'] = False
    for i in range(len(group)):
        for j in range(i+1, len(group)):
            time_diff = group.iloc[j]['response_datetime'] - group.iloc[i]['response_datetime']
            if time_diff <= timedelta(minutes=10):
                group.at[i, 'close_response'] = True
                group.at[j, 'close_response'] = True
    return group

df = df.groupby('property_id').apply(mark_close_responses)
like image 769
jp207 Avatar asked Oct 17 '25 21:10

jp207


1 Answers

Compute the difference between successive rows using diff, get rows from the difference that are less than or equal to 10 minutes, use a conditional to get accurate positions per group, back fill and assign back to the original dataframe:

difference=df.groupby('property_id').response_datetime.diff()
out = difference.le(pd.Timedelta(minutes=10))
out = difference.where(difference.isna(), out).bfill()
df.assign(close_response=out)
   property_id   response_datetime  close_response
0            1 2023-10-19 08:00:00            True
1            1 2023-10-19 08:05:00            True
2            1 2023-10-19 08:25:00           False
3            2 2023-10-19 09:00:00           False
4            2 2023-10-19 09:15:00           False
5            3 2023-10-19 10:00:00           False
6            3 2023-10-19 11:00:00           False
like image 57
sammywemmy Avatar answered Oct 19 '25 13:10

sammywemmy



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!