Finding all neighbors within a fixed distance in another dataframe

Question

I am trying to figure out the fastest way to find all neighbors with a fixed distance in another Dataframe.

To better illustrate my question, I have made a hypothetical example. Suppose there are two dataframe order_df and trade_df, each have two columns - one is ID and another one is datetime.

import datetime
import pandas as pd

order_dict = {
    'Order ID': ['Order 1', 'Order 2', 'Order 3', 'Order 4'],
    'Order Time': [datetime.datetime(2023, 3, 8, 7, 5, 0), datetime.datetime(2023, 3, 8, 7, 10, 0), datetime.datetime(2023, 3, 8, 7, 15, 0), datetime.datetime(2023, 3, 8, 7, 20, 0)]
}

trade_dict = {
    'Trade ID': ['Trade 1001', 'Trade 1002', 'Trade 1003', 'Trade 1004', 'Trade 1005'],
    'Trade Time': [datetime.datetime(2023, 3, 8, 7, 8, 0), datetime.datetime(2023, 3, 8, 7, 9, 0), datetime.datetime(2023, 3, 8, 7, 17, 0), datetime.datetime(2023, 3, 8, 7, 18, 0), datetime.datetime(2023, 3, 8, 7, 30, 0)]
}

What I expect is: for each row in order_df, I look for all Trade ID in trade_df with Trade Time is within an arbitrary minutes (e.g. 3 mins) before or after Order Time

E.g. when the fixed distance = 3 mins, I would expect the output to be:

    Order ID    Order Time  Nearest Trade ID
0   Order 1 2023-03-08 07:05:00 Trade 1001
1   Order 2 2023-03-08 07:10:00 Trade 1001, Trade 1002
2   Order 3 2023-03-08 07:15:00 Trade 1003, Trade 1004
3   Order 4 2023-03-08 07:20:00 Trade 1003, Trade 1004

I have made the following attempt:

def fixed_distance_nearest_neighbors(x, fixed_distance, lookup_df, lookup_field, return_field):
    return ', '.join(lookup_df[lookup_df[lookup_field].between(x - datetime.timedelta(minutes=fixed_distance), x + datetime.timedelta(minutes=fixed_distance))][return_field])

order_df['Nearest Trade ID'] = order_df['Order Time'].apply(lambda x: fixed_distance_nearest_neighbors(x, 3, trade_df, 'Trade Time', 'Trade ID'))

However, I believe this is not an efficient way to do this. To be specific, when the size of order_df and trade_df grows to a large number, the above method is extremely slow, e.g.

import random
import time

def generate_random_date(n):
    random_date = []
    for i in range(0, n):
        random_date.append(datetime.datetime(2023, 3, 8, random.randint(0, 23), random.randint(0, 59), random.randint(0, 59)))
    return random_date

n = 100000
order_dict = {
    'Order ID': ['Order ' + str(x) for x in range(1, n + 1)],
    'Order Time': generate_random_date(n)
}

trade_dict = {
    'Trade ID': ['Trade ' + str(x) for x in range(1, n + 1)],
    'Trade Time': generate_random_date(n)
}

start = time.time()
order_df['Nearest Trade ID'] = order_df['Order Time'].apply(lambda x: fixed_distance_nearest_neighbors(x, 3, trade_df, 'Trade Time', 'Trade ID'))
end = time.time()
print(end - start)

#Output 214.34553575515747

It takes 3.5 mins to complete the execution. I believe there is some more pythonic way to boost the speed. Thank you very much.

Willem Hendriks · Accepted Answer

The Following makes use of the BallTree.

The 10.000 generated order/trade example you used is processed in about 5 seconds. The reasons why its 'slow' is because many trade/order pairs collide. If the date ranges are more spread, and only a handful of trades are associated with a order it is faster.

I only show the example for the toy example. The exact same code was used to test for the larger instance.

With order_dict and trade_dict , I used the following to convert to unix time;

order = pd.DataFrame(order_dict)
trade = pd.DataFrame(trade_dict)

order["unix_time"] = (order["Order Time"] - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')
trade["unix_time"] = (trade["Trade Time"] - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')

We prepare the data for the tree with;

from sklearn.neighbors import BallTree
import numpy as np

order_times = np.expand_dims( order.unix_time.values, axis=1 )
trade_times = np.expand_dims( trade.unix_time.values, axis=1 )

...and create and query our tree with;

tree = BallTree(trade_times, leaf_size=15, metric="manhattan")
idx, distance_in_seconds = tree.query_radius( order_times, r=60*3, sort_results=True, return_distance=True)

The above operation is the actual trade/order linking, and took about 5 seconds for 10.000 trades/orders.Building the tree also takes time. So if you know you are going to repeat this operation, try not rebuild the tree if not needed.

Note we used Manhattan distance, for absolute time in seconds a trade can be away from a order
From query_radius(), we used sort_results=True, return_distance=True to make sure our output is ordered. i.e. The closest trade/order is first.
Note the radius is 60*3 seconds.

order["Nearest Trades"] = idx

Will give us

  Order ID          Order Time   unix_time Nearest Trades
0  Order 1 2023-03-08 07:05:00  1678259100            [0]
1  Order 2 2023-03-08 07:10:00  1678259400         [1, 0]
2  Order 3 2023-03-08 07:15:00  1678259700         [2, 3]
3  Order 4 2023-03-08 07:20:00  1678260000         [3, 2]

Notice that for Order 2, the closest trade is trade 1, right after trade 0. Getting the exact names becomes a heavy operation for larger instances, so I would use indici instead of names.

You also have distance_in_seconds which returns the distance in seconds. It has the same shape as idx but instead of the index, it contains the distance in seconds for its corresponding index.

Corralien · Answer

You can use how='cross' as parameter of merge (if your dataframes are not too large):

trades = (order_df.merge(trade_df, how='cross')
                  .loc[lambda x: x['Order Time'].sub(x['Trade Time']).abs().lt('3min')]
                  .groupby('Order ID')['Trade ID'].agg(', '.join))

order_df['Nearest Trade'] = order_df['Order ID'].map(trades).fillna('')

Output:

>>> order_df
  Order ID          Order Time           Nearest Trade
0  Order 1 2023-03-08 07:05:00                        
1  Order 2 2023-03-08 07:10:00  Trade 1001, Trade 1002
2  Order 3 2023-03-08 07:15:00              Trade 1003
3  Order 4 2023-03-08 07:20:00              Trade 1004

Finding all neighbors within a fixed distance in another dataframe

Tags:

python

lookup

pandas

dataframe

Raymond

2 Answers

Willem Hendriks

Corralien

Recent Activity

Donate For Us

Finding all neighbors within a fixed distance in another dataframe

Tags:

python

lookup

pandas

dataframe

Raymond

2 Answers

Willem Hendriks

Corralien

Related questions

Recent Activity

Donate For Us