I have a table looks below.
| lon | lat | output | |
|---|---|---|---|
| 4050 | -47.812224 | -19.043365 | 1890.283215 |
| 5149 | -47.812224 | -19.043365 | 1890.283215 |
| 7316 | -47.812224 | -19.043365 | 1890.283215 |
| 8406 | -47.812224 | -19.043365 | 1890.283215 |
| 511 | -47.812014 | -19.007094 | 1813.785728 |
| 1555 | -47.812014 | -19.007094 | 1813.785728 |
| 3764 | -47.812014 | -19.007094 | 1821.363582 |
| 4846 | -47.812014 | -19.007094 | 1813.785728 |
| 29 | -47.811177 | -19.008053 | 1763.091936 |
| 1114 | -47.811177 | -19.008053 | 1763.091936 |
| 3262 | -47.811177 | -19.008053 | 1763.091936 |
| 4357 | -47.811177 | -19.008053 | 1763.091936 |
| 1436 | -47.774424 | -19.008700 | 2172.781911 |
| 2557 | -47.774424 | -19.008700 | 2174.394848 |
| 4725 | -47.774424 | -19.008700 | 2172.781911 |
| 5840 | -47.774424 | -19.008700 | 2172.781911 |
| 5211 | -47.774166 | -19.043847 | 2897.092502 |
| 6313 | -47.774166 | -19.043847 | 2897.092502 |
| 8460 | -47.774166 | -19.043847 | 2897.092502 |
| 9543 | -47.774166 | -19.043847 | 2897.092502 |
| 1679 | -47.773958 | -19.007574 | 2179.670924 |
| 2770 | -47.773958 | -19.007574 | 2179.670924 |
| 4998 | -47.773958 | -19.007574 | 2179.670924 |
| 6088 | -47.773958 | -19.007574 | 2179.670924 |
| 1937 | -47.773121 | -19.008533 | 2236.769862 |
| 3004 | -47.773121 | -19.008533 | 2236.769862 |
| 5231 | -47.773121 | -19.008533 | 2236.769862 |
| 6332 | -47.773121 | -19.008533 | 2236.769862 |
I would like to drop the duplicates by using groupby on lon and lat but keep the most repeated value on output
for example
| lon | lat | output |
|---|---|---|
| -47.812224 | -19.043365 | 1890.283215 |
| -47.812014 | -19.007094 | 1813.785728 |
| -47.811177 | -19.008053 | 1763.091936 |
| -47.774424 | -19.008700 | 2172.781911 |
| -47.774166 | -19.043847 | 2897.092502 |
| -47.773958 | -19.007574 | 2179.670924 |
| -47.773121 | -19.008533 | 2236.769862 |
Could anyone tell me how to do this?
We can use the .groupby aggregate methods as an alternative to Andrej's method of using .apply and computing for each row.
While it does solve our problem .apply methods tend to become slower for a large dataset due to lack of vectorization.
Also, reset_index works much faster when used along with 'inplace=True'.
%%timeit
df.groupby(['lat', 'lon']).agg(pd.Series.mode).reset_index(inplace=True)
Hope that helps!!
You can combine .groupby with Series.mode:
x = df.groupby(["lon", "lat"])["output"].apply(lambda x: x.mode()[0])
print(x.reset_index())
Prints:
lon lat output
0 -47.812224 -19.043365 1890.283215
1 -47.812014 -19.007094 1813.785728
2 -47.811177 -19.008053 1763.091936
3 -47.774424 -19.008700 2172.781911
4 -47.774166 -19.043847 2897.092502
5 -47.773958 -19.007574 2179.670924
6 -47.773121 -19.008533 2236.769862
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With