Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I merge two dataframes based on last date of each group?

These are my DataFrames:

import pandas as pd
df1 = pd.DataFrame(
    {
        'close': [100, 150, 200, 55, 69, 221, 2210, 111, 120, 140, 150, 170],
        'date': [
            '2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04',
            '2024-01-05', '2024-01-06', '2024-01-07', '2024-01-08',
            '2024-01-09', '2024-01-10', '2024-01-11', '2024-01-12',
        ]

    }
)

df2 = pd.DataFrame(
    {
        'group': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'],
        'close': [100, 105, 112, 117, 55, 65, 221, 211],
        'date': [
            '2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04',
            '2024-01-05', '2024-01-06', '2024-01-07', '2024-01-08'
        ],

        'extend': [
            '2024-01-09', '2024-01-09', '2024-01-09', '2024-01-09',
            '2024-01-11', '2024-01-11', '2024-01-11', '2024-01-11'
        ],

    }
)

And this is the expected output. I want to extend df2 for each group in group column:

  group  close        date          extend
0     a    100      2024-01-01      2024-01-09
1     a    105      2024-01-02      2024-01-09
2     a    112      2024-01-03      2024-01-09
3     a    117      2024-01-04      2024-01-09
4     a    69       2024-01-05      2024-01-09
5     a    221      2024-01-06      2024-01-09
6     a    2210     2024-01-07      2024-01-09
7     a    111      2024-01-08      2024-01-09
8     a    120      2024-01-09      2024-01-09

7     b     55      2024-01-05      2024-01-11
8     b     65      2024-01-06      2024-01-11
9     b    221      2024-01-07      2024-01-11
10    b    211      2024-01-08      2024-01-11
11    b    120      2024-01-09      2024-01-11
12    b    140      2024-01-10      2024-01-11
13    b    150      2024-01-11      2024-01-11

The logic is:

Each group in df2 has a fixed extend date. This is the basically the date that each group should be extended using df1.

For example for group a, The data should be extended from 2024-01-04 to 2024-01-09. The start point of extending is basically df2.date.iloc[-1] for each group and the end is the extend.

This is my attempt that didn't work:

import janitor
def func(df2, df1):
    df2['extend_start'] = df2.date.iloc[-1]
    df2['extend_start'] = pd.to_datetime(df2.extend_start)

    df3 = df2.conditional_join(
        df1,
        ('extend_start', 'date', '<'),
        ('extend', 'date', '>')
    )

    return df3

df1['date'] = pd.to_datetime(df1.date)
df2['extend'] = pd.to_datetime(df2.extend)

out = df2.groupby('group').apply(func, df1=df1)
like image 985
Amir Avatar asked Sep 06 '25 19:09

Amir


1 Answers

You could add the missing dates with groupby.apply, then map the unknown dates:

# ensure datetime
df1['date'] = pd.to_datetime(df1['date'])
df2[['date', 'extend']] = df2[['date', 'extend']].apply(pd.to_datetime)

# fill missing dates
# map missing one with values of df1
out = (df2.set_index('date').groupby('group')
          .apply(lambda x: x.reindex(pd.date_range(min(x.index.min(), x['extend'].min()),
                                                   max(x.index.max(), x['extend'].max()),
                                                  )).rename_axis('date')
                            .drop(columns=['group'])
                            .assign(extend=lambda x: x['extend'].ffill(),)
                )
          .reset_index()
          .assign(close=lambda x: x['close'].fillna(x['date'].map(df1.set_index('date')['close'])))
      )

Output:

   group       date   close     extend
0      a 2024-01-01   100.0 2024-01-09
1      a 2024-01-02   105.0 2024-01-09
2      a 2024-01-03   112.0 2024-01-09
3      a 2024-01-04   117.0 2024-01-09
4      a 2024-01-05    69.0 2024-01-09
5      a 2024-01-06   221.0 2024-01-09
6      a 2024-01-07  2210.0 2024-01-09
7      a 2024-01-08   111.0 2024-01-09
8      a 2024-01-09   120.0 2024-01-09
9      b 2024-01-05    55.0 2024-01-11
10     b 2024-01-06    65.0 2024-01-11
11     b 2024-01-07   221.0 2024-01-11
12     b 2024-01-08   211.0 2024-01-11
13     b 2024-01-09   120.0 2024-01-11
14     b 2024-01-10   140.0 2024-01-11
15     b 2024-01-11   150.0 2024-01-11

Intermediate before the map step:

   group       date  close     extend
0      a 2024-01-01  100.0 2024-01-09
1      a 2024-01-02  105.0 2024-01-09
2      a 2024-01-03  112.0 2024-01-09
3      a 2024-01-04  117.0 2024-01-09
4      a 2024-01-05    NaN 2024-01-09
5      a 2024-01-06    NaN 2024-01-09
6      a 2024-01-07    NaN 2024-01-09
7      a 2024-01-08    NaN 2024-01-09
8      a 2024-01-09    NaN 2024-01-09
9      b 2024-01-05   55.0 2024-01-11
10     b 2024-01-06   65.0 2024-01-11
11     b 2024-01-07  221.0 2024-01-11
12     b 2024-01-08  211.0 2024-01-11
13     b 2024-01-09    NaN 2024-01-11
14     b 2024-01-10    NaN 2024-01-11
15     b 2024-01-11    NaN 2024-01-11
like image 102
mozway Avatar answered Sep 09 '25 13:09

mozway