How can I replace the values from certain columns in a pandas.DataFrame that occur rarely, i.e. with low frequency (while ignoring NaNs)?
For example, in the following dataframe, suppose I wanted to replace any values in columns A or B that occur less than three times in their respective column. I want to replace these with "other":
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':['ant','ant','cherry', pd.np.nan, 'ant'], 'B':['cat','peach', 'cat', 'cat', 'peach'], 'C':['dog','dog',pd.np.nan, 'emu', 'emu']})
df
   A   |   B   |  C  |
----------------------
ant    | cat   | dog |
ant    | peach | dog |
cherry | cat   | NaN |
NaN    | cat   | emu |
ant    | peach | emu |
In other words, in columns A and B, I want to replace those values that occur twice or less (but leave NaNs alone).
So the output I want is:
   A   |   B   |  C  |
----------------------
ant    | cat   | dog |
ant    | other | dog |
other  | cat   | NaN |
NaN    | cat   | emu |
ant    | other | emu |
This is related to a previously posted question: Remove low frequency values from pandas.dataframe
but the solution there resulted in an "AttributeError: 'NoneType' object has no attribute 'any.'" (I think because I have NaN values?)
This is pretty similar to Change values in pandas dataframe according to value_counts(). You can add a condition to the lambda function to exclude column 'C' as follows:
df.apply(lambda x: x.mask(x.map(x.value_counts())<3, 'other') if x.name!='C' else x)
Out: 
       A      B    C
0    ant    cat  dog
1    ant  other  dog
2  other    cat  NaN
3    NaN    cat  emu
4    ant  other  emu
This basically iterates over columns. For each column, it generates value counts and uses that Series for mapping. This allows x.mask to check the condition whether the count is smaller than 3 or not. If that is the case, it returns 'other' and if not it uses the actual value. Lastly, a condition checks the column name.
lambda's condition can be generalized for multiple columns by changing it to x.name not in 'CDEF' or x.name not in ['C', 'D', 'E', 'F'] from x.name!='C'.
using a helper function and replace
def replace_low_freq(df, threshold=2, replacement='other'):
    s = df.stack()
    c = s.value_counts()
    m = pd.Series(replacement, c.index[c <= threshold])
    return s.replace(m).unstack()
cols = list('AB')
replace_low_freq(df[cols]).join(df.drop(cols, 1))
       A      B    C
0    ant    cat  dog
1    ant  other  dog
2  other    cat  NaN
3   None    cat  emu
4    ant  other  emu
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With