Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter rows where dates are available across all groups using Pandas

I have the following dataframe (sample):

import pandas as pd

data = [['A', '2022-09-01'], ['A', '2022-09-03'], ['A', '2022-09-07'], ['A', '2022-09-08'],
        ['B', '2022-09-03'], ['B', '2022-09-07'], ['B', '2022-09-08'], ['B', '2022-09-09'],
        ['C', '2022-09-01'], ['C', '2022-09-03'], ['C', '2022-09-07'], ['C', '2022-09-10'],
        ['D', '2022-09-01'], ['D', '2022-09-03'], ['D', '2022-09-05'], ['D', '2022-09-07']]
df = pd.DataFrame(data = data, columns = ['group', 'date'])

   group        date
0      A  2022-09-01
1      A  2022-09-03
2      A  2022-09-07
3      A  2022-09-08
4      B  2022-09-03
5      B  2022-09-07
6      B  2022-09-08
7      B  2022-09-09
8      C  2022-09-01
9      C  2022-09-03
10     C  2022-09-07
11     C  2022-09-10
12     D  2022-09-01
13     D  2022-09-03
14     D  2022-09-05
15     D  2022-09-07

I would like to filter the dates which are available across all groups. For example, the date "2022-09-03" is available in groups: A, B, C and D so all groups. The date "2022-09-01" is only available in groups: A, C, and D which means it is missing in group B. Here is the desired output:

data = [['A', '2022-09-03'], ['A', '2022-09-07'], ['B', '2022-09-03'], ['B', '2022-09-07'], 
        ['C', '2022-09-03'], ['C', '2022-09-07'], ['D', '2022-09-03'], ['D', '2022-09-07']]
df_desired = pd.DataFrame(data = data, columns = ['group', 'date'])

  group        date
0     A  2022-09-03
1     A  2022-09-07
2     B  2022-09-03
3     B  2022-09-07
4     C  2022-09-03
5     C  2022-09-07
6     D  2022-09-03
7     D  2022-09-07

I know how to filter groups with all the same values within a group, but I want to filter the dates which are available in each group. So I was wondering if anyone knows how to perform this using pandas?

like image 957
Quinten Avatar asked Oct 27 '25 05:10

Quinten


2 Answers

You can get all dates which exist in each group by crosstab and filter columns names:

df1 = pd.crosstab(df['group'],df['date'])

df = df[df['date'].isin(df1.columns[df1.ne(0).all()])]
print (df)
   group        date
1      A  2022-09-03
2      A  2022-09-07
4      B  2022-09-03
5      B  2022-09-07
9      C  2022-09-03
10     C  2022-09-07
13     D  2022-09-03
15     D  2022-09-07
like image 158
jezrael Avatar answered Oct 28 '25 18:10

jezrael


One option is to group on the dates, get the number of uniques and prune the original dataframe:

df.loc[df.groupby('date').group.transform('nunique').eq(df.group.nunique())]

   group        date
1      A  2022-09-03
2      A  2022-09-07
4      B  2022-09-03
5      B  2022-09-07
9      C  2022-09-03
10     C  2022-09-07
13     D  2022-09-03
15     D  2022-09-07
like image 43
sammywemmy Avatar answered Oct 28 '25 20:10

sammywemmy