Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove rows from a categorical variable whose value counts do not satisfy a condition?

I am new to ML and Data Science (recently graduated from Master's in Business Analytics) and learning as much as I can by myself now while looking for positions in Data Science / Business Analytics.

I am working on a practice dataset with a goal of predicting which customers are likely to miss their scheduled appointment. One of the columns in my dataset is "Neighbourhood", which contains names of over 30 different neighborhoods. My dataset has 10,000 observations, and some neighborhood names only appear less than 50 times. I think that neighborhoods that appear less than 50 times in the dataset are too rare to be analyzed properly by machine learning models. Therefore, I want to remove the names of the neighborhoods from the "Neighborhood" column which appear in that column less than 50 times.

I have been trying to write a code for this for several hours, but struggle to get it right. So far, I have gotten to the version below:

my_df = my_df.drop(my_df["Neighbourhood"].value_counts() < 50, axis = 0)

I have also tried other versions of code to get rid of the rows in that categorical column, but I keep getting a similar error:

KeyError: '[False False ...  True  True] not found in axis'

I appreciate your help in advance, and thank you for sharing your knowledge and insights with me!

like image 626
Arsik36 Avatar asked Nov 06 '25 22:11

Arsik36


1 Answers

Try the code below - it uses the .loc operator to select rows on the basis of a certain condition (i.e. in neighborhoods with high counts)

counts = my_df['Neighborhood'].value_counts()
new_df = my_df.loc[my_df['Neighborhood'].isin(counts.index[counts > 50])]
like image 111
katardin Avatar answered Nov 09 '25 13:11

katardin