How to group data frame with similar text in python

Question

I have a data frame DF like this:

DF = pd.DataFrame({'Code':['abc', 'abc', 'abc', 'abc', 'def'],  
               'Description':['ABC String', 'ABC String', 'ABC String and sth', 'Only sth else', 'ABC String'],     
               'Value':[10, 20, 30, 40, 100]})

enter image description here

I need to group it by Code and Description. Grouping by Code is simple:

GR = DF.groupby('Code')

enter image description here

Now I want to continue with grouping by Description, so all values that are equal or similar (have a common part) are grouped together. Can you help me with a formula to get something like this:

enter image description here

It might be there are two questions: 'equal values' and 'similar values'. If there is any hint on at least the 'equal values', that would be great.

Max Pierini · Accepted Answer

You could also use fuzzywuzzy to compute the Levensthein distance, even if there are more than two "similar" values

For example

import numpy as np
import pandas as pd
from fuzzywuzzy import fuzz

DF = pd.DataFrame({'Code':['abc', 'abc', 'abc', 'abc', 'def', 'def', 'def', 'abc'],  
               'Description':['ABC String', 'ABC String', 
                              'ABC String and sth', 'Only sth else', 
                              'ABC String', 'CDEFGH', 'CDEFGH and sth', 
                              'CDEFGH and sth',],
               'Value':[10, 20, 30, 40, 50, 60, 70, 80]}) 

# for each unique value in Description
for d in DF.Description.unique():
    # compute Levensthein distance
    # and set to True if >= a limit
    # (you may have to play around with it)
    DF[d] = DF['Description'].apply(
        lambda x : fuzz.ratio(x, d) >= 60
    )
    # set a name for the group
    # here, simply the shortest
    m = np.min(DF[DF[d]==True].Description)
    # assign the group
    DF.loc[DF.Description==d, 'group'] = m

print(DF)

  Code         Description  Value  ABC String          group  \
0  abc          ABC String     10        True     ABC String   
1  abc          ABC String     20        True     ABC String   
2  abc  ABC String and sth     30        True     ABC String   
3  abc       Only sth else     40       False  Only sth else   
4  def          ABC String     50        True     ABC String   
5  def              CDEFGH     60       False         CDEFGH   
6  def      CDEFGH and sth     70       False         CDEFGH   
7  abc      CDEFGH and sth     80       False         CDEFGH   

   ABC String and sth  Only sth else  CDEFGH  CDEFGH and sth  
0                True          False   False           False  
1                True          False   False           False  
2                True          False   False           False  
3               False           True   False           False  
4                True          False   False           False  
5               False          False    True            True  
6               False          False    True            True  
7               False          False    True            True

And now you can groupby the created groups

DF.groupby('group').Value.mean()

group
ABC String       27.5
CDEFGH           70.0
Only sth else    40.0
Name: Value, dtype: float64

How to group data frame with similar text in python

Tags:

python

pandas

pandas-groupby

Rafal Xxx

1 Answers

Max Pierini

Recent Activity

Donate For Us

How to group data frame with similar text in python

Tags:

python

pandas

pandas-groupby

Rafal Xxx

1 Answers

Max Pierini

Related questions

Recent Activity

Donate For Us