Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check if column values in variable time range are unique

I have a DataFrame similar to this but with > 10000000 rows:

data = {'timestamp': ['1970-01-01 00:27:00', '1970-01-01 00:27:10', '1970-01-01 00:27:20',
                      '1970-01-01 00:27:30', '1970-01-01 00:27:40', '1970-01-01 00:27:50',
                      '1970-01-01 00:28:00', '1970-01-01 00:28:10', '1970-01-01 00:28:20',
                      '1970-01-01 00:28:30', '1970-01-01 00:28:40', '1970-01-01 00:28:50'],
        'label': [0, 0, 1, 1, 1, 1, 0, 0, 1 , 1, 1 ,0]}
df = pd.DataFrame(data, columns=['label'], index=data['timestamp'])
df.index = pd.to_datetime(df.index)


Index                 label
1970-01-01 00:27:00   0
1970-01-01 00:27:10   0
1970-01-01 00:27:20   1
1970-01-01 00:27:30   1
1970-01-01 00:27:40   1
1970-01-01 00:27:50   1
1970-01-01 00:28:00   0
1970-01-01 00:28:10   0
1970-01-01 00:28:20   1
1970-01-01 00:28:30   1
1970-01-01 00:28:40   1
1970-01-01 00:28:50   0

The goal is to keep all rows where the column 'label' equals to 0 and to keep only those rows where the value for the column 'label' equals to 1 and is unique for a given time range. For example, besides the 0 values, I only want to keep the rows where a 1 is given at least for 30 seconds constantly. Result should look like this:

Index                 label
1970-01-01 00:27:00   0
1970-01-01 00:27:10   0
1970-01-01 00:27:20   1
1970-01-01 00:27:30   1
1970-01-01 00:27:40   1
1970-01-01 00:27:50   1
1970-01-01 00:28:00   0
1970-01-01 00:28:10   0
1970-01-01 00:28:50   0

The following code does the job, but for huge datasets (like I have) it is impracticable.

from datetime import timedelta

valid_range = 30
valid_df = df[df['label'] == 1].index.values.size
df_temp = df.copy()
drop_list = []

while valid_df != 0:
    begin = df_temp[df_temp['label'] == 1].index[0]
    end = begin + timedelta(seconds=valid_range)

    if df_temp['label'].loc[begin:end].nunique() == 1:
        df_temp = df_temp.loc[df_temp.index > end]
    else:
        df_temp.drop(begin, axis=0, inplace=True)
        drop_list.append(begin)

    valid_df = df_temp[df_temp['label'] == 1].index.values.size

df.drop(drop_list, axis=0, inplace=True)

Any suggestions on how to do this better/faster/with less memory consumption?


EDIT: My DataFrame may have time gaps and is not continuous so I can't use the proposed answer to this question.

like image 802
nkldtd Avatar asked Dec 07 '25 05:12

nkldtd


1 Answers

You can try a combination of groupby and filtering the group results

import pandas as pd

data = {'timestamp': ['1970-01-01 00:27:00', '1970-01-01 00:27:10', '1970-01-01 00:27:20',
                  '1970-01-01 00:27:30', '1970-01-01 00:27:40', '1970-01-01 00:27:50',
                  '1970-01-01 00:28:00', '1970-01-01 00:28:10', '1970-01-01 00:28:20',
                  '1970-01-01 00:28:30', '1970-01-01 00:28:40', '1970-01-01 00:28:50'
                  ],
    'label': [0, 0, 1, 1, 1, 1, 0, 0, 1 , 1, 1 ,0]}
df = pd.DataFrame(data, columns=['label'], index=data['timestamp'])
df["time"] = df.index
df["time"] = pd.to_datetime(df["time"],errors='coerce')
df["delta"]= (df["time"]-df["time"].shift()).dt.total_seconds()
gp = df.groupby([(df.label != df.label.shift()).cumsum()])
rem = gp.filter(lambda g: g.delta.sum()>30)
new_df= pd.concat([rem[rem.label==1],df[df.label==0]], axis =0).sort_index()
like image 133
Varsha Venkatesh Avatar answered Dec 08 '25 18:12

Varsha Venkatesh