Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to group data frame with similar text in python

I have a data frame DF like this:

DF = pd.DataFrame({'Code':['abc', 'abc', 'abc', 'abc', 'def'],  
               'Description':['ABC String', 'ABC String', 'ABC String and sth', 'Only sth else', 'ABC String'],     
               'Value':[10, 20, 30, 40, 100]})  

enter image description here

I need to group it by Code and Description. Grouping by Code is simple:

GR = DF.groupby('Code')

enter image description here

Now I want to continue with grouping by Description, so all values that are equal or similar (have a common part) are grouped together. Can you help me with a formula to get something like this:

enter image description here

It might be there are two questions: 'equal values' and 'similar values'. If there is any hint on at least the 'equal values', that would be great.

like image 405
Rafal Xxx Avatar asked Oct 31 '25 18:10

Rafal Xxx


1 Answers

You could also use fuzzywuzzy to compute the Levensthein distance, even if there are more than two "similar" values

For example

import numpy as np
import pandas as pd
from fuzzywuzzy import fuzz

DF = pd.DataFrame({'Code':['abc', 'abc', 'abc', 'abc', 'def', 'def', 'def', 'abc'],  
               'Description':['ABC String', 'ABC String', 
                              'ABC String and sth', 'Only sth else', 
                              'ABC String', 'CDEFGH', 'CDEFGH and sth', 
                              'CDEFGH and sth',],
               'Value':[10, 20, 30, 40, 50, 60, 70, 80]}) 

# for each unique value in Description
for d in DF.Description.unique():
    # compute Levensthein distance
    # and set to True if >= a limit
    # (you may have to play around with it)
    DF[d] = DF['Description'].apply(
        lambda x : fuzz.ratio(x, d) >= 60
    )
    # set a name for the group
    # here, simply the shortest
    m = np.min(DF[DF[d]==True].Description)
    # assign the group
    DF.loc[DF.Description==d, 'group'] = m

print(DF)

  Code         Description  Value  ABC String          group  \
0  abc          ABC String     10        True     ABC String   
1  abc          ABC String     20        True     ABC String   
2  abc  ABC String and sth     30        True     ABC String   
3  abc       Only sth else     40       False  Only sth else   
4  def          ABC String     50        True     ABC String   
5  def              CDEFGH     60       False         CDEFGH   
6  def      CDEFGH and sth     70       False         CDEFGH   
7  abc      CDEFGH and sth     80       False         CDEFGH   

   ABC String and sth  Only sth else  CDEFGH  CDEFGH and sth  
0                True          False   False           False  
1                True          False   False           False  
2                True          False   False           False  
3               False           True   False           False  
4                True          False   False           False  
5               False          False    True            True  
6               False          False    True            True  
7               False          False    True            True  

And now you can groupby the created groups

DF.groupby('group').Value.mean()

group
ABC String       27.5
CDEFGH           70.0
Only sth else    40.0
Name: Value, dtype: float64
like image 121
Max Pierini Avatar answered Nov 02 '25 08:11

Max Pierini



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!