Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find first occurrence for each id based on datetime column with pandas?

Tags:

python

pandas

I have seen a lot of similar questions but didn't quite find an answer to my specific problem. Let's say I have a df:

    sample_id     tested_at   test_value
            1    2020-07-21            5
            1    2020-07-22            4
            1    2020-07-23            6
            2    2020-07-26            6
            2    2020-07-28            5
            3    2020-07-22            4
            3    2020-07-27            4
            3    2020-07-30            6 

The df is already sorted for ascending by tested_at column. I now need to add another column first_test which would indicate the first test value for each sample_id in every line, regardless if it is highest or not. The output should be:

    sample_id     tested_at   test_value   first_test
            1    2020-07-21            5            5
            1    2020-07-22            4            5
            1    2020-07-23            6            5
            2    2020-07-26            6            6
            2    2020-07-28            5            6
            3    2020-07-22            4            4
            3    2020-07-27            4            4
            3    2020-07-30            6            4

The df is also quite big, so a faster way would be very much appreciated.

like image 999
Geormy White Avatar asked Oct 27 '25 10:10

Geormy White


1 Answers

You can use pandas' groupby to group by sample ID, and then use the transform method to get the first value per sample ID. Note that this takes the first value by row number, not the first value by date, so make sure the rows are ordered by date.

df = pd.DataFrame(
    [
        [1, "2020-07-21", 5],
        [1, "2020-07-22", 4],
        [1, "2020-07-23", 6],
        [2, "2020-07-26", 6],
        [2, "2020-07-28", 5],
        [3, "2020-07-22", 4],
        [3, "2020-07-27", 4],
        [3, "2020-07-30", 6],
    ],
    columns=["sample_id", "tested_at", "test_value"],
)

df["first_test"] = df.groupby("sample_id")["test_value"].transform("first")

Which results in:

   sample_id   tested_at  test_value  first_test
0          1  2020-07-21    5           5
1          1  2020-07-22    4           5
2          1  2020-07-23    6           5
3          2  2020-07-26    6           6
4          2  2020-07-28    5           6
5          3  2020-07-22    4           4
6          3  2020-07-27    4           4
7          3  2020-07-30    6           4
like image 176
Swier Avatar answered Oct 28 '25 23:10

Swier



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!