I have a serialized dataset that has its content separated by spaces, like this #a value1 #b value2 .... where the first element with # is the column name and second is the value. My problem occurs in some sections of this dataset that has a sequence like this "#% value1 #% value2" this especific mark represent a column with multiple values, in this way, I need a mechanism to transformer this multiple lines in one. Eg. Original data = #a value1 #b value2 #% value3 #% value4 #a value5 #b value6 #% value7 #% value8
After my split process:
Key value
#a. Value1
#b. Value2
#%. Value3
#%. Value4
#a. Value5
#b. Value6
#%. Value7
#%. Value8
But I need this:
Key value
#a. Value1
#b. Value2
#%. Value3,Value4
#a. Value5
#b. Value6
#%. Value7,Value8
How can I perform this local groupby using pandas? One detail is that is a huge dataset (~2Gb) and I'm running all this in a good, but normal, PC.
First create the help key by using shift and cumsum , then it become the regular groupby and join problem
s=(df.Key!=df.Key.shift()).cumsum()
df.groupby([df.Key,s]).value.apply(','.join).\
sort_index(level=1).\
reset_index(level=1,drop=True)
Out[788]:
Key
#a. Value1
#b. Value2
#%. Value3,Value4
#a. Value5
#b. Value6
#%. Value7,Value8
Name: value, dtype: object
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With