Let's say I have the following dataframe
id time lat long
0 1 2020-11-01 21:48:00 66.027694 12.627349
1 1 2020-11-01 21:49:00 66.027833 12.630198
2 1 2020-11-01 21:50:00 66.027900 12.635473
3 1 2020-11-01 21:51:00 66.027967 12.640748
4 1 2020-11-01 21:52:00 66.028350 12.643367
5 1 2020-11-01 21:53:00 66.028450 12.643948
6 1 2020-11-01 21:54:00 66.028183 12.643750
7 1 2020-11-01 21:55:00 66.027767 12.643016
8 2 2020-11-01 23:30:00 66.031667 12.639148
9 2 2020-11-01 23:31:00 66.034033 12.637517
10 2 2020-11-01 23:32:00 66.036950 12.636683
11 2 2020-11-01 23:33:00 66.039742 12.636417
12 2 2020-11-01 23:34:00 66.042533 12.636150
13 2 2020-11-01 23:35:00 66.044725 12.636541
14 2 2020-11-01 23:36:00 66.046867 12.637715
15 2 2020-11-01 23:37:00 66.050550 12.641467
16 2 2020-11-01 23:38:00 66.053014 12.644047
17 2 2020-11-01 23:39:00 66.055478 12.646627
18 2 2020-11-01 23:40:00 66.057942 12.649207
19 2 2020-11-01 23:41:00 66.060406 12.651788
20 2 2020-11-01 23:42:00 66.062869 12.654368
21 2 2020-11-01 23:43:00 66.065333 12.656948
22 2 2020-11-01 23:44:00 66.067255 12.658876
23 2 2020-11-01 23:45:00 66.069177 12.660804
24 2 2020-11-01 23:46:00 66.071098 12.662732
And I want to resample every group by it's id number so i get 5 points equally spaced out (in time) for each group.
The result should look like this from the above example.
id time lat long
0 1 2020-11-01 21:47:15 66.027694 12.627349
1 1 2020-11-01 21:49:00 66.027867 12.632836
2 1 2020-11-01 21:50:45 66.028158 12.642057
3 1 2020-11-01 21:52:30 66.028317 12.643849
4 1 2020-11-01 21:54:15 66.027767 12.643016
5 2 2020-11-01 23:28:00 66.032850 12.638333
6 2 2020-11-01 23:32:00 66.040987 12.636448
7 2 2020-11-01 23:36:00 66.051477 12.642464
8 2 2020-11-01 23:40:00 66.061638 12.653078
9 2 2020-11-01 23:44:00 66.069177 12.660804
I have already solved it and got the desired result, but it is way to slow since i don't have 25 rows but instead +10 million rows.
There is properly a better solution then mine
My code is:
# Define amount of points
points = 5
# route is the input dataframe (see first table from above)
groups = route.groupby('id')
# 'times' is for getting the first and last time in each group
times = groups['time'].agg(['first','last']).reset_index()
# Calculation the time step for getting 5 datapoints
times['diff'] = (times['last'] - times['first'])/(points-1)
# For saving each series of points
waypoints = []
for (name, group), (time_name, time_group) in zip(groups, times_groups):
# Time step to string in seconds (Not the best solution)
str_time = "{0}s".format(int(time_group['diff'].iloc[0].total_seconds()))
# Saving points
waypoints.append(
group.set_index('time').groupby(
'id'
).resample(
str_time
).mean().interpolate('linear').drop('id', axis = 1).reset_index()
)
# Concatenate back to dataframe (see last table from above)
pd_waypoints = pd.concat(waypoints).reset_index()
Here's one way to speed it up. The idea is to replicate what resample
does, which is essentially a groupby on truncated times, but use different frequencies for different ids without going through groups one by one (other than to calculate frequencies):
# make a copy of the route dataframe to work on
z = route.copy()
# calculate frequency f in seconds for each id
# and t0 as the midnight of the first day of the group
g = z.groupby('id')['time']
z['f'] = (g.transform('max') - g.transform('min')).astype(int) / (points - 1) // 10**9
z['t0'] = g.transform('min').dt.floor('d').astype(int) // 10**9
# calculate seconds since t0
# this is what .resample(...) operates on
z['s_since_t0'] = z['time'].astype(int) // 10**9 - z['t0']
# get grouped seconds since t0
# in the same way that .resample(...) does
z['s_group'] = z['t0'] + z['s_since_t0'] // z['f'] * z['f']
# convert grouped seconds to datetime
z['time_group'] = pd.to_datetime(z['s_group'], unit='s')
# calculate mean
z.groupby(['id', 'time_group'])[['lat', 'long']].mean().reset_index()
Output:
id time_group lat long
0 1 2020-11-01 21:47:15 66.027694 12.627349
1 1 2020-11-01 21:49:00 66.027867 12.632835
2 1 2020-11-01 21:50:45 66.028159 12.642057
3 1 2020-11-01 21:52:30 66.028317 12.643849
4 1 2020-11-01 21:54:15 66.027767 12.643016
5 2 2020-11-01 23:28:00 66.032850 12.638332
6 2 2020-11-01 23:32:00 66.040987 12.636448
7 2 2020-11-01 23:36:00 66.051477 12.642464
8 2 2020-11-01 23:40:00 66.061638 12.653078
9 2 2020-11-01 23:44:00 66.069177 12.660804
On a 10k dataset this version is ~400x faster than the original:
%%timeit
original()
3.72 s ± 21.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
proposed()
8.83 ms ± 43.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With