Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

map column values to 'miscellaneous' if value counts is below a threshold - Categorical Column - Pandas Dataframe

Tags:

python

pandas

I have a pandas dataframe of shape ~ [200K, 40]. The dataframe has a categorical column (one of many) with over 1000 unique values. I can visualizee the value counts of each such unique column by using:

df['column_name'].value_counts()

How do i now club values with:

  • value_count less than a threshold value, say, 100, and map them to, say, "miscellaneous"?
  • OR based on the cumulative row count % ?
like image 802
redwolf_cr7 Avatar asked Oct 14 '25 17:10

redwolf_cr7


1 Answers

You can extract the values you want to mask from the index of value_counts and them map them to "miscellaneous" using replace:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0, 10, (2000, 2)), columns=['A', 'B'])

frequencies = df['A'].value_counts()

condition = frequencies<200   # you can define it however you want
mask_obs = frequencies[condition].index
mask_dict = dict.fromkeys(mask_obs, 'miscellaneous')

df['A'] = df['A'].replace(mask_dict)  # or you could make a copy not to modify original data

Now, using value_counts will group all the values below your threshold as miscellaneous:

df['A'].value_counts()

df['A'].value_counts()
Out[18]: 
miscellaneous    947
3                226
1                221
0                204
7                201
2                201
like image 51
FLab Avatar answered Oct 19 '25 01:10

FLab