I have a dataframe data, which has close to 4 millions rows. It is a list of cities on the world. I need to query the city name as fast as possible.
I found out that one with 346ms via indexing the city name:
d2=data.set_index("city",inplace=False)
timeit d2.loc[['PARIS']]
1 loop, best of 3: 346 ms per loop
This is still much too slow. I wonder if with group-by I could achieve faster query (how to do such a query). Each city has around 10 rows in the dataframe (duplicate city). I searched several days and could not find a clear solution on the internet
thank you
Setup
df = pd.DataFrame(data=[['Paris'+str(i),i] for i in range(100000)]*10,columns=['city','value'])
Baseline
df2 = df.set_index('city')
%timeit df2.loc[['Paris9999']]
10 loops, best of 3: 45.6 ms per loop
Solution
Using a lookup dict and then use iloc:
idx_dict = df.groupby(by='city').apply(lambda x: x.index.tolist()).to_dict()
%timeit df.iloc[d['Paris9999']]
1000 loops, best of 3: 432 µs per loop
It seems this approach is almost 100 times faster than the baseline.
Comparing to other approaches:
%timeit df2[df2.index.values=="Paris9999"]
100 loops, best of 3: 16.7 ms per loop
%timeit full_array_based(df2, "Paris9999")
10 loops, best of 3: 19.6 ms per loop
Working with the array data for the index, comparing against the needed index and then using the mask from the comparison might be one option when looking for performance. A sample case might make things clear.
1) Input dataframes :
In [591]: df
Out[591]:
city population
0 Delhi 1000
1 Paris 56
2 NY 89
3 Paris 36
4 Delhi 300
5 Paris 52
6 Paris 34
7 Delhi 40
8 NY 89
9 Delhi 450
In [592]: d2 = df.set_index("city",inplace=False)
In [593]: d2
Out[593]:
population
city
Delhi 1000
Paris 56
NY 89
Paris 36
Delhi 300
Paris 52
Paris 34
Delhi 40
NY 89
Delhi 450
2) Indexing with .loc :
In [594]: d2.loc[['Paris']]
Out[594]:
population
city
Paris 56
Paris 36
Paris 52
Paris 34
3) Use mask based indexing :
In [595]: d2[d2.index.values=="Paris"]
Out[595]:
population
city
Paris 56
Paris 36
Paris 52
Paris 34
4) Finally timings :
In [596]: %timeit d2.loc[['Paris']]
1000 loops, best of 3: 475 µs per loop
In [597]: %timeit d2[d2.index.values=="Paris"]
10000 loops, best of 3: 156 µs per loop
Further boost
Going further with using array data, we can extract the entire input dataframe as array and index into it. Thus, an implementation using that philosophy would look something like this -
def full_array_based(d2, indexval):
df0 = pd.DataFrame(d2.values[d2.index.values==indexval])
df0.index = [indexval]*df0.shape[0]
df0.columns = d2.columns
return df0
Sample run and timings -
In [635]: full_array_based(d2, "Paris")
Out[635]:
population
Paris 56
Paris 36
Paris 52
Paris 34
In [636]: %timeit full_array_based(d2, "Paris")
10000 loops, best of 3: 146 µs per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With