pandas groupby based on multi-columns, but keep the most repeated duplicates number on other column

Question

I have a table looks below.

	lon	lat	output
4050	-47.812224	-19.043365	1890.283215
5149	-47.812224	-19.043365	1890.283215
7316	-47.812224	-19.043365	1890.283215
8406	-47.812224	-19.043365	1890.283215
511	-47.812014	-19.007094	1813.785728
1555	-47.812014	-19.007094	1813.785728
3764	-47.812014	-19.007094	1821.363582
4846	-47.812014	-19.007094	1813.785728
29	-47.811177	-19.008053	1763.091936
1114	-47.811177	-19.008053	1763.091936
3262	-47.811177	-19.008053	1763.091936
4357	-47.811177	-19.008053	1763.091936
1436	-47.774424	-19.008700	2172.781911
2557	-47.774424	-19.008700	2174.394848
4725	-47.774424	-19.008700	2172.781911
5840	-47.774424	-19.008700	2172.781911
5211	-47.774166	-19.043847	2897.092502
6313	-47.774166	-19.043847	2897.092502
8460	-47.774166	-19.043847	2897.092502
9543	-47.774166	-19.043847	2897.092502
1679	-47.773958	-19.007574	2179.670924
2770	-47.773958	-19.007574	2179.670924
4998	-47.773958	-19.007574	2179.670924
6088	-47.773958	-19.007574	2179.670924
1937	-47.773121	-19.008533	2236.769862
3004	-47.773121	-19.008533	2236.769862
5231	-47.773121	-19.008533	2236.769862
6332	-47.773121	-19.008533	2236.769862

I would like to drop the duplicates by using groupby on lon and lat but keep the most repeated value on output

for example

lon	lat	output
-47.812224	-19.043365	1890.283215
-47.812014	-19.007094	1813.785728
-47.811177	-19.008053	1763.091936
-47.774424	-19.008700	2172.781911
-47.774166	-19.043847	2897.092502
-47.773958	-19.007574	2179.670924
-47.773121	-19.008533	2236.769862

Could anyone tell me how to do this?

Rushiraj Chavan · Accepted Answer

We can use the .groupby aggregate methods as an alternative to Andrej's method of using .apply and computing for each row.

While it does solve our problem .apply methods tend to become slower for a large dataset due to lack of vectorization.

Also, reset_index works much faster when used along with 'inplace=True'.

%%timeit
df.groupby(['lat', 'lon']).agg(pd.Series.mode).reset_index(inplace=True)

Hope that helps!!

Andrej Kesely · Answer

You can combine .groupby with Series.mode:

x = df.groupby(["lon", "lat"])["output"].apply(lambda x: x.mode()[0])
print(x.reset_index())

Prints:

         lon        lat       output
0 -47.812224 -19.043365  1890.283215
1 -47.812014 -19.007094  1813.785728
2 -47.811177 -19.008053  1763.091936
3 -47.774424 -19.008700  2172.781911
4 -47.774166 -19.043847  2897.092502
5 -47.773958 -19.007574  2179.670924
6 -47.773121 -19.008533  2236.769862

pandas groupby based on multi-columns, but keep the most repeated duplicates number on other column

Tags:

pandas

duplicates

pandas-groupby

Franke Hsu

2 Answers

Rushiraj Chavan

Andrej Kesely

Recent Activity

Donate For Us

pandas groupby based on multi-columns, but keep the most repeated duplicates number on other column

Tags:

pandas

duplicates

pandas-groupby

Franke Hsu

2 Answers

Rushiraj Chavan

Andrej Kesely

Related questions

Recent Activity

Donate For Us