I am new to ML and Data Science (recently graduated from Master's in Business Analytics) and learning as much as I can by myself now while looking for positions in Data Science / Business Analytics.
I am working on a practice dataset with a goal of predicting which customers are likely to miss their scheduled appointment. One of the columns in my dataset is "Neighbourhood", which contains names of over 30 different neighborhoods. My dataset has 10,000 observations, and some neighborhood names only appear less than 50 times. I think that neighborhoods that appear less than 50 times in the dataset are too rare to be analyzed properly by machine learning models. Therefore, I want to remove the names of the neighborhoods from the "Neighborhood" column which appear in that column less than 50 times.
I have been trying to write a code for this for several hours, but struggle to get it right. So far, I have gotten to the version below:
my_df = my_df.drop(my_df["Neighbourhood"].value_counts() < 50, axis = 0)
I have also tried other versions of code to get rid of the rows in that categorical column, but I keep getting a similar error:
KeyError: '[False False ... True True] not found in axis'
I appreciate your help in advance, and thank you for sharing your knowledge and insights with me!
Try the code below - it uses the .loc operator to select rows on the basis of a certain condition (i.e. in neighborhoods with high counts)
counts = my_df['Neighborhood'].value_counts()
new_df = my_df.loc[my_df['Neighborhood'].isin(counts.index[counts > 50])]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With