I am using Pandas, Jupyter Notebooks and Python. I have a dataset with with 4 columns and 10000 records. Currently when I use the following code to pick up duplicates, somehow the code is picking up incorrect records. FYI: the datatype of the columns are as follow:
Initial_Date = int64
Final_Date = int64
Origin = object
sub_location = object
My current code is:
dup = df.duplicated(['Initial_Date','Final_Date','Origin','sub_location'], keep='last')
Here is an example of the dataset that is being picked up using the above code:
00121980,00121980,Australia,Brighton:Queensland
00121980,00121980,Australia,Brisbane:Queensland
17021987,17021987,Bangladesh,Sylhet-Sunamganj
17021987,17021987,Brazil,Sao Paolo suburb
If you look at the first two records: the initial and final date and Origin are matching however the sub_origin is not matching, one is Brighton and the other is Brisbane.
Same applies in the last two records, the dates are matching but Origin is not the same.
From this, I understand that df.duplicated is not picking up correct records or I am not using it properly. Do data types matter with df.duplicated?
If I just use df.duplicated then the boolean series that is returned has NO duplicates. Can someone please explain/show me how .duplicated is used?
Please keep in mind that this is not the full dataset however the example that I have presented is exactly the problem I have in the real dataset. I narrowed the df.duplicated criteria and came across this error.
Thanks guys :D
pay attention at the keep parameter:
In [116]: s = pd.Series([1,1,1,2,3])
In [117]: s
Out[117]:
0 1
1 1
2 1
3 2
4 3
dtype: int64
In [118]: s.duplicated(keep='first')
Out[118]:
0 False
1 True
2 True
3 False
4 False
dtype: bool
In [119]: s.duplicated(keep='last')
Out[119]:
0 True
1 True
2 False
3 False
4 False
dtype: bool
In [120]: s.duplicated(keep=False)
Out[120]:
0 True
1 True
2 True
3 False
4 False
dtype: bool
I guess you want to use keep=False
from docs:
keep : {‘first’, ‘last’, False}, default ‘first’
first : Mark duplicates as True except for the first occurrence.
last : Mark duplicates as True except for the last occurrence.
False : Mark all duplicates as True.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With