Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python/Pandas: Merging Consecutive Rows Only if Matching Columns

Tags:

python

pandas

I have a large dataset consisting of records of objects travelling within a network. The dataset is stored in a pandas dataset, which roughly looks like this:

Obj ID | Timestamp | x | y | link ID
-------|-----------|---|---|---------
 123   |  506      |123|456|  ab12
 123   |  517      |129|436|  ab12
 123   |  519      |125|454|  cd34
 126   |  501      |123|426|  gk77
 126   |  505      |153|453|  gk77
 126   |  507      |129|856|  bj88
 126   |  508      |143|496|  bj88
 126   |  512      |125|427|  gk77
 126   |  515      |153|453|  gk77
 126   |  518      |127|256|  ab12

The dataframe has been sorted by Obj ID, and each block of records belonging to an object has been sorted by time (the Timestamp field in seconds). Two consecutive rows with the same Obj ID and link ID represent the start and end of an object of that link. The object eventually ends at a link, denoted by a single link ID appearing at the end of the object's record.

Here's the desired output. Condense the start-record and the end-record of one object's visit to one link into one record. For the ending-links of each object, just fill in StartTime for EndTime.

Obj ID | StartTime | EndTime | link ID
-------|-----------|---------|---------
 123   |  506      |   517   |  ab12
 123   |  519      |   519   |  cd34
 126   |  501      |   505   |  gk77
 126   |  507      |   508   |  bj88
 126   |  512      |   515   |  gk77
 126   |  518      |   518   |  ab12

Note:

  • Cannot guarantee that an object will not travel onto the same link twice. But each visit to the link needs to be a separate record.

  • A naive loop solution that I implemented ran out of memory, due to the dataset being quite large.

EDIT: I edited the dataset to include the situation mentioned in the first point of my "notes".

like image 487
Kev W. Avatar asked Sep 12 '25 18:09

Kev W.


1 Answers

Let's try this:

g =(df['link ID'] != df.shift().fillna(method='bfill')['link ID']).cumsum().rename('group')

df.groupby(['Obj ID','link ID',g])['Timestamp'].agg(['min','max']).reset_index().rename(columns={'min':'StartTime','max':'EndTime'}).drop('group',axis=1)

Output:

    Obj ID link ID    StartTime      EndTime
0   123       ab12    506          517      
1   123       cd34    519          519      
2   126       ab12    518          518      
3   126       bj88    507          508      
4   126       gk77    501          505      
5   126       gk77    512          515      
like image 126
Scott Boston Avatar answered Sep 15 '25 09:09

Scott Boston