Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Partition Pandas DataFrame using DateTime

Tags:

python

pandas

I am writing a Python script to import pictures from my digital cameras, and I'm using Pandas to help with the bookkeeping of the incoming images. I am using the EXIF data to tag individual images with information, such as the Camera Model, image mode, image format and the timestamp of when the image was acquired by the camera. These data are used to segregate the images into a directory structure. What I'm struggling with is how to use Pandas to group images based on a clump of timestamps that are, for example, all within half-an-hour of each other. As an example, let's say I have six images, three of which were taken within nine minutes of each other and then the other three, also within nine minutes of each other, but an hour later.

import pandas
import datetime    
rawdata = [{'filename': 'image_1.jpg',
  'timestamp': datetime.datetime(2014, 11, 13, 19, 14, 16, 152847)},
 {'filename': 'image_2.jpg',
  'timestamp': datetime.datetime(2014, 11, 13, 19, 17, 16, 152847)},
 {'filename': 'image_3.jpg',
  'timestamp': datetime.datetime(2014, 11, 13, 19, 20, 16, 152847)},
 {'filename': 'image_4.jpg',
  'timestamp': datetime.datetime(2014, 11, 13, 20, 14, 16, 152847)},
 {'filename': 'image_5.jpg',
  'timestamp': datetime.datetime(2014, 11, 13, 20, 17, 16, 152847)},
 {'filename': 'image_6.jpg',
  'timestamp': datetime.datetime(2014, 11, 13, 20, 20, 16, 152847)}]
df = pandas.DataFrame(rawdata)

Is there an automatic way to partition this DataFrame with a half-an-hour threshold such that I would have image_1, image_2 and image_3 in one DataFrame and image_4, image_5 and image_6 in a second DataFrame?

like image 536
vishnubob Avatar asked Dec 10 '25 07:12

vishnubob


1 Answers

IIUC, one way would be to use the diff-compare-cumsum idiom to get cluster numbers which you can then use to groupby:

>>> df = df.sort("timestamp")
>>> cluster = (df["timestamp"].diff() > pd.Timedelta(minutes=30)).cumsum()
>>> dfs = [v for k,v in df.groupby(cluster)]
>>> for clust in dfs:
...     print(clust)
...     
      filename                  timestamp
0  image_1.jpg 2014-11-13 19:14:16.152847
1  image_2.jpg 2014-11-13 19:17:16.152847
2  image_3.jpg 2014-11-13 19:20:16.152847
      filename                  timestamp
3  image_4.jpg 2014-11-13 20:14:16.152847
4  image_5.jpg 2014-11-13 20:17:16.152847
5  image_6.jpg 2014-11-13 20:20:16.152847

This works because False ~ 0 and True ~ 1, so if we take the cumulative sum of something which becomes 1 every time a new cluster is found, we'll get the ids we want:

>>> df["timestamp"].diff()
0        NaT
1   00:03:00
2   00:03:00
3   00:54:00
4   00:03:00
5   00:03:00
Name: timestamp, dtype: timedelta64[ns]
>>> df["timestamp"].diff() > pd.Timedelta(minutes=30)
0    False
1    False
2    False
3     True
4    False
5    False
Name: timestamp, dtype: bool
>>> (df["timestamp"].diff() > pd.Timedelta(minutes=30)).cumsum()
0    0
1    0
2    0
3    1
4    1
5    1
Name: timestamp, dtype: int64
like image 179
DSM Avatar answered Dec 11 '25 20:12

DSM



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!