Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

optimize a string query with pandas. large data

I have a dataframe data, which has close to 4 millions rows. It is a list of cities on the world. I need to query the city name as fast as possible.

I found out that one with 346ms via indexing the city name:

d2=data.set_index("city",inplace=False)

timeit d2.loc[['PARIS']]

1 loop, best of 3: 346 ms per loop

This is still much too slow. I wonder if with group-by I could achieve faster query (how to do such a query). Each city has around 10 rows in the dataframe (duplicate city). I searched several days and could not find a clear solution on the internet

thank you

like image 767
laurentngu Avatar asked Oct 27 '25 04:10

laurentngu


2 Answers

Setup

df = pd.DataFrame(data=[['Paris'+str(i),i] for i in range(100000)]*10,columns=['city','value'])

Baseline

df2 = df.set_index('city')
%timeit df2.loc[['Paris9999']]
10 loops, best of 3: 45.6 ms per loop

Solution

Using a lookup dict and then use iloc:

idx_dict = df.groupby(by='city').apply(lambda x: x.index.tolist()).to_dict()

%timeit df.iloc[d['Paris9999']]
1000 loops, best of 3: 432 µs per loop

It seems this approach is almost 100 times faster than the baseline.

Comparing to other approaches:

%timeit df2[df2.index.values=="Paris9999"]
100 loops, best of 3: 16.7 ms per loop

%timeit full_array_based(df2, "Paris9999")
10 loops, best of 3: 19.6 ms per loop
like image 137
Allen Avatar answered Oct 29 '25 18:10

Allen


Working with the array data for the index, comparing against the needed index and then using the mask from the comparison might be one option when looking for performance. A sample case might make things clear.

1) Input dataframes :

In [591]: df
Out[591]: 
    city  population
0  Delhi        1000
1  Paris          56
2     NY          89
3  Paris          36
4  Delhi         300
5  Paris          52
6  Paris          34
7  Delhi          40
8     NY          89
9  Delhi         450

In [592]: d2 = df.set_index("city",inplace=False)

In [593]: d2
Out[593]: 
       population
city             
Delhi        1000
Paris          56
NY             89
Paris          36
Delhi         300
Paris          52
Paris          34
Delhi          40
NY             89
Delhi         450

2) Indexing with .loc :

In [594]: d2.loc[['Paris']]
Out[594]: 
       population
city             
Paris          56
Paris          36
Paris          52
Paris          34

3) Use mask based indexing :

In [595]: d2[d2.index.values=="Paris"]
Out[595]: 
       population
city             
Paris          56
Paris          36
Paris          52
Paris          34

4) Finally timings :

In [596]: %timeit d2.loc[['Paris']]
1000 loops, best of 3: 475 µs per loop

In [597]: %timeit d2[d2.index.values=="Paris"]
10000 loops, best of 3: 156 µs per loop

Further boost

Going further with using array data, we can extract the entire input dataframe as array and index into it. Thus, an implementation using that philosophy would look something like this -

def full_array_based(d2, indexval):
    df0 = pd.DataFrame(d2.values[d2.index.values==indexval])
    df0.index = [indexval]*df0.shape[0]
    df0.columns = d2.columns
    return df0

Sample run and timings -

In [635]: full_array_based(d2, "Paris")
Out[635]: 
       population
Paris          56
Paris          36
Paris          52
Paris          34

In [636]: %timeit full_array_based(d2, "Paris")
10000 loops, best of 3: 146 µs per loop
like image 45
Divakar Avatar answered Oct 29 '25 19:10

Divakar



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!