Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Groupby pandas in different sections

I have a serialized dataset that has its content separated by spaces, like this #a value1 #b value2 .... where the first element with # is the column name and second is the value. My problem occurs in some sections of this dataset that has a sequence like this "#% value1 #% value2" this especific mark represent a column with multiple values, in this way, I need a mechanism to transformer this multiple lines in one. Eg. Original data = #a value1 #b value2 #% value3 #% value4 #a value5 #b value6 #% value7 #% value8

After my split process:

Key    value
#a.     Value1
#b.     Value2
#%.    Value3
#%.    Value4
#a.     Value5
#b.     Value6
#%.    Value7
#%.    Value8

But I need this:

Key    value
    #a.     Value1
    #b.     Value2
    #%.    Value3,Value4
    #a.     Value5
    #b.     Value6
    #%.    Value7,Value8

How can I perform this local groupby using pandas? One detail is that is a huge dataset (~2Gb) and I'm running all this in a good, but normal, PC.

like image 900
Juliano Oliveira Avatar asked Dec 07 '25 18:12

Juliano Oliveira


1 Answers

First create the help key by using shift and cumsum , then it become the regular groupby and join problem

s=(df.Key!=df.Key.shift()).cumsum()
df.groupby([df.Key,s]).value.apply(','.join).\
     sort_index(level=1).\
       reset_index(level=1,drop=True)
Out[788]: 
Key
#a.           Value1
#b.           Value2
#%.    Value3,Value4
#a.           Value5
#b.           Value6
#%.    Value7,Value8
Name: value, dtype: object
like image 68
BENY Avatar answered Dec 09 '25 15:12

BENY



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!