I have seen a lot of similar questions but didn't quite find an answer to my specific problem. Let's say I have a df:
sample_id tested_at test_value
1 2020-07-21 5
1 2020-07-22 4
1 2020-07-23 6
2 2020-07-26 6
2 2020-07-28 5
3 2020-07-22 4
3 2020-07-27 4
3 2020-07-30 6
The df is already sorted for ascending by tested_at column. I now need to add another column first_test which would indicate the first test value for each sample_id in every line, regardless if it is highest or not. The output should be:
sample_id tested_at test_value first_test
1 2020-07-21 5 5
1 2020-07-22 4 5
1 2020-07-23 6 5
2 2020-07-26 6 6
2 2020-07-28 5 6
3 2020-07-22 4 4
3 2020-07-27 4 4
3 2020-07-30 6 4
The df is also quite big, so a faster way would be very much appreciated.
You can use pandas' groupby to group by sample ID, and then use the transform method to get the first value per sample ID. Note that this takes the first value by row number, not the first value by date, so make sure the rows are ordered by date.
df = pd.DataFrame(
[
[1, "2020-07-21", 5],
[1, "2020-07-22", 4],
[1, "2020-07-23", 6],
[2, "2020-07-26", 6],
[2, "2020-07-28", 5],
[3, "2020-07-22", 4],
[3, "2020-07-27", 4],
[3, "2020-07-30", 6],
],
columns=["sample_id", "tested_at", "test_value"],
)
df["first_test"] = df.groupby("sample_id")["test_value"].transform("first")
Which results in:
sample_id tested_at test_value first_test
0 1 2020-07-21 5 5
1 1 2020-07-22 4 5
2 1 2020-07-23 6 5
3 2 2020-07-26 6 6
4 2 2020-07-28 5 6
5 3 2020-07-22 4 4
6 3 2020-07-27 4 4
7 3 2020-07-30 6 4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With