what's up?
I am having a little problem, where I need to use the pandas dropna function to remove rows from my dataframe. However, I need it to not delete the unique values from my dataframe.
Let me explain better. I have the following dataframe:
| id | birthday | 
|---|---|
| 0102-2 | 09/03/2020 | 
| 0103-2 | 14/03/2020 | 
| 0104-2 | NaN | 
| 0105-2 | NaN | 
| 0105-2 | 25/03/2020 | 
| 0108-2 | 07/04/2020 | 
In the case above, I need to delete the row from my dataframe based on the NaN values in the birthday column. However, as you can see the id "0104-2" is unique unlike the id "0105-2" where it has a NaN value and another with a date. So I would like to keep track of all the lines that have NaN that are unique.
Is it feasible to do this with dropna, or would I have to pre-process the information beforehand?
You could sort by the birthday column and then drop duplicates keeping the first out of the two, by doing the following:
The complete code would look like this:
import pandas as pd
import numpy as np
data = {
    "id": ['102-2','103-2','104-2', '105-2', '105-2', '108-2'],
    "birthday":['09/03/2020', '14/03/2020', np.nan, np.nan, '25/03/2020', '07/04/2020']
}
df = pd.DataFrame(data)
df.sort_values(['birthday'], inplace=True)
df.drop_duplicates(subset="id", keep='first', inplace=True)
df.sort_values(['id'], inplace=True)

CODE EXPLANATION: Here is the original dataframe:
import pandas as pd
import numpy as np
data = {
    "id": ['102-2','103-2','104-2', '105-2', '105-2', '108-2'],
    "birthday":['09/03/2020', '14/03/2020', np.nan, np.nan, '25/03/2020', '07/04/2020']
}
df = pd.DataFrame(data)

Now sort the dataframe:
df.sort_values(['birthday'], inplace=True)

Then drop the duplicates based on the id column. Keeping only the first value.
df.drop_duplicates(subset="id", keep='first', inplace=True)

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With