Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas resample n points on large dataframes

Tags:

python

pandas

Let's say I have the following dataframe

    id                time        lat       long
0    1 2020-11-01 21:48:00  66.027694  12.627349
1    1 2020-11-01 21:49:00  66.027833  12.630198
2    1 2020-11-01 21:50:00  66.027900  12.635473
3    1 2020-11-01 21:51:00  66.027967  12.640748
4    1 2020-11-01 21:52:00  66.028350  12.643367
5    1 2020-11-01 21:53:00  66.028450  12.643948
6    1 2020-11-01 21:54:00  66.028183  12.643750
7    1 2020-11-01 21:55:00  66.027767  12.643016
8    2 2020-11-01 23:30:00  66.031667  12.639148
9    2 2020-11-01 23:31:00  66.034033  12.637517
10   2 2020-11-01 23:32:00  66.036950  12.636683
11   2 2020-11-01 23:33:00  66.039742  12.636417
12   2 2020-11-01 23:34:00  66.042533  12.636150
13   2 2020-11-01 23:35:00  66.044725  12.636541
14   2 2020-11-01 23:36:00  66.046867  12.637715
15   2 2020-11-01 23:37:00  66.050550  12.641467
16   2 2020-11-01 23:38:00  66.053014  12.644047
17   2 2020-11-01 23:39:00  66.055478  12.646627
18   2 2020-11-01 23:40:00  66.057942  12.649207
19   2 2020-11-01 23:41:00  66.060406  12.651788
20   2 2020-11-01 23:42:00  66.062869  12.654368
21   2 2020-11-01 23:43:00  66.065333  12.656948
22   2 2020-11-01 23:44:00  66.067255  12.658876
23   2 2020-11-01 23:45:00  66.069177  12.660804
24   2 2020-11-01 23:46:00  66.071098  12.662732

And I want to resample every group by it's id number so i get 5 points equally spaced out (in time) for each group.

The result should look like this from the above example.

   id                time        lat       long
0   1 2020-11-01 21:47:15  66.027694  12.627349
1   1 2020-11-01 21:49:00  66.027867  12.632836
2   1 2020-11-01 21:50:45  66.028158  12.642057
3   1 2020-11-01 21:52:30  66.028317  12.643849
4   1 2020-11-01 21:54:15  66.027767  12.643016
5   2 2020-11-01 23:28:00  66.032850  12.638333
6   2 2020-11-01 23:32:00  66.040987  12.636448
7   2 2020-11-01 23:36:00  66.051477  12.642464
8   2 2020-11-01 23:40:00  66.061638  12.653078
9   2 2020-11-01 23:44:00  66.069177  12.660804

I have already solved it and got the desired result, but it is way to slow since i don't have 25 rows but instead +10 million rows.

There is properly a better solution then mine

My code is:

# Define amount of points
points = 5

# route is the input dataframe (see first table from above)
groups = route.groupby('id')

# 'times' is for getting the first and last time in each group
times = groups['time'].agg(['first','last']).reset_index()

# Calculation the time step for getting 5 datapoints
times['diff'] = (times['last'] - times['first'])/(points-1)

# For saving each series of points
waypoints = []
for (name, group), (time_name, time_group) in zip(groups, times_groups):
   # Time step to string in seconds (Not the best solution)
   str_time = "{0}s".format(int(time_group['diff'].iloc[0].total_seconds()))
   # Saving points
   waypoints.append(
      group.set_index('time').groupby(
         'id'
      ).resample(
         str_time
      ).mean().interpolate('linear').drop('id', axis = 1).reset_index()
   )
# Concatenate back to dataframe (see last table from above)
pd_waypoints = pd.concat(waypoints).reset_index()
like image 409
Lampen Avatar asked Sep 19 '25 04:09

Lampen


1 Answers

Here's one way to speed it up. The idea is to replicate what resample does, which is essentially a groupby on truncated times, but use different frequencies for different ids without going through groups one by one (other than to calculate frequencies):

# make a copy of the route dataframe to work on
z = route.copy()

# calculate frequency f in seconds for each id
# and t0 as the midnight of the first day of the group
g = z.groupby('id')['time']
z['f'] = (g.transform('max') - g.transform('min')).astype(int) / (points - 1) // 10**9
z['t0'] = g.transform('min').dt.floor('d').astype(int) // 10**9

# calculate seconds since t0
# this is what .resample(...) operates on
z['s_since_t0'] = z['time'].astype(int) // 10**9 - z['t0']

# get grouped seconds since t0
# in the same way that .resample(...) does
z['s_group'] = z['t0'] + z['s_since_t0'] // z['f'] * z['f']

# convert grouped seconds to datetime
z['time_group'] = pd.to_datetime(z['s_group'], unit='s')

# calculate mean
z.groupby(['id', 'time_group'])[['lat', 'long']].mean().reset_index()

Output:

   id          time_group        lat       long
0   1 2020-11-01 21:47:15  66.027694  12.627349
1   1 2020-11-01 21:49:00  66.027867  12.632835
2   1 2020-11-01 21:50:45  66.028159  12.642057
3   1 2020-11-01 21:52:30  66.028317  12.643849
4   1 2020-11-01 21:54:15  66.027767  12.643016
5   2 2020-11-01 23:28:00  66.032850  12.638332
6   2 2020-11-01 23:32:00  66.040987  12.636448
7   2 2020-11-01 23:36:00  66.051477  12.642464
8   2 2020-11-01 23:40:00  66.061638  12.653078
9   2 2020-11-01 23:44:00  66.069177  12.660804

On a 10k dataset this version is ~400x faster than the original:

%%timeit
original()

3.72 s ± 21.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
proposed()

8.83 ms ± 43.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
like image 66
perl Avatar answered Sep 21 '25 22:09

perl