I am trying to figure out the fastest way to find all neighbors with a fixed distance in another Dataframe.
To better illustrate my question, I have made a hypothetical example. Suppose there are two dataframe order_df and trade_df, each have two columns - one is ID and another one is datetime.
import datetime
import pandas as pd
order_dict = {
'Order ID': ['Order 1', 'Order 2', 'Order 3', 'Order 4'],
'Order Time': [datetime.datetime(2023, 3, 8, 7, 5, 0), datetime.datetime(2023, 3, 8, 7, 10, 0), datetime.datetime(2023, 3, 8, 7, 15, 0), datetime.datetime(2023, 3, 8, 7, 20, 0)]
}
trade_dict = {
'Trade ID': ['Trade 1001', 'Trade 1002', 'Trade 1003', 'Trade 1004', 'Trade 1005'],
'Trade Time': [datetime.datetime(2023, 3, 8, 7, 8, 0), datetime.datetime(2023, 3, 8, 7, 9, 0), datetime.datetime(2023, 3, 8, 7, 17, 0), datetime.datetime(2023, 3, 8, 7, 18, 0), datetime.datetime(2023, 3, 8, 7, 30, 0)]
}
What I expect is: for each row in order_df, I look for all Trade ID in trade_df with Trade Time is within an arbitrary minutes (e.g. 3 mins) before or after Order Time
E.g. when the fixed distance = 3 mins, I would expect the output to be:
Order ID Order Time Nearest Trade ID
0 Order 1 2023-03-08 07:05:00 Trade 1001
1 Order 2 2023-03-08 07:10:00 Trade 1001, Trade 1002
2 Order 3 2023-03-08 07:15:00 Trade 1003, Trade 1004
3 Order 4 2023-03-08 07:20:00 Trade 1003, Trade 1004
I have made the following attempt:
def fixed_distance_nearest_neighbors(x, fixed_distance, lookup_df, lookup_field, return_field):
return ', '.join(lookup_df[lookup_df[lookup_field].between(x - datetime.timedelta(minutes=fixed_distance), x + datetime.timedelta(minutes=fixed_distance))][return_field])
order_df['Nearest Trade ID'] = order_df['Order Time'].apply(lambda x: fixed_distance_nearest_neighbors(x, 3, trade_df, 'Trade Time', 'Trade ID'))
However, I believe this is not an efficient way to do this. To be specific, when the size of order_df and trade_df grows to a large number, the above method is extremely slow, e.g.
import random
import time
def generate_random_date(n):
random_date = []
for i in range(0, n):
random_date.append(datetime.datetime(2023, 3, 8, random.randint(0, 23), random.randint(0, 59), random.randint(0, 59)))
return random_date
n = 100000
order_dict = {
'Order ID': ['Order ' + str(x) for x in range(1, n + 1)],
'Order Time': generate_random_date(n)
}
trade_dict = {
'Trade ID': ['Trade ' + str(x) for x in range(1, n + 1)],
'Trade Time': generate_random_date(n)
}
start = time.time()
order_df['Nearest Trade ID'] = order_df['Order Time'].apply(lambda x: fixed_distance_nearest_neighbors(x, 3, trade_df, 'Trade Time', 'Trade ID'))
end = time.time()
print(end - start)
#Output 214.34553575515747
It takes 3.5 mins to complete the execution. I believe there is some more pythonic way to boost the speed. Thank you very much.
The Following makes use of the BallTree.
The 10.000 generated order/trade example you used is processed in about 5 seconds. The reasons why its 'slow' is because many trade/order pairs collide. If the date ranges are more spread, and only a handful of trades are associated with a order it is faster.
I only show the example for the toy example. The exact same code was used to test for the larger instance.
With order_dict and trade_dict , I used the following to convert to unix time;
order = pd.DataFrame(order_dict)
trade = pd.DataFrame(trade_dict)
order["unix_time"] = (order["Order Time"] - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')
trade["unix_time"] = (trade["Trade Time"] - pd.Timestamp("1970-01-01")) // pd.Timedelta('1s')
We prepare the data for the tree with;
from sklearn.neighbors import BallTree
import numpy as np
order_times = np.expand_dims( order.unix_time.values, axis=1 )
trade_times = np.expand_dims( trade.unix_time.values, axis=1 )
...and create and query our tree with;
tree = BallTree(trade_times, leaf_size=15, metric="manhattan")
idx, distance_in_seconds = tree.query_radius( order_times, r=60*3, sort_results=True, return_distance=True)
The above operation is the actual trade/order linking, and took about 5 seconds for 10.000 trades/orders.Building the tree also takes time. So if you know you are going to repeat this operation, try not rebuild the tree if not needed.
sort_results=True, return_distance=True to make sure our output is ordered. i.e. The closest trade/order is first.radius is 60*3 seconds.order["Nearest Trades"] = idx
Will give us
Order ID Order Time unix_time Nearest Trades
0 Order 1 2023-03-08 07:05:00 1678259100 [0]
1 Order 2 2023-03-08 07:10:00 1678259400 [1, 0]
2 Order 3 2023-03-08 07:15:00 1678259700 [2, 3]
3 Order 4 2023-03-08 07:20:00 1678260000 [3, 2]
Notice that for Order 2, the closest trade is trade 1, right after trade 0. Getting the exact names becomes a heavy operation for larger instances, so I would use indici instead of names.
You also have distance_in_seconds which returns the distance in seconds. It has the same shape as idx but instead of the index, it contains the distance in seconds for its corresponding index.
You can use how='cross' as parameter of merge (if your dataframes are not too large):
trades = (order_df.merge(trade_df, how='cross')
.loc[lambda x: x['Order Time'].sub(x['Trade Time']).abs().lt('3min')]
.groupby('Order ID')['Trade ID'].agg(', '.join))
order_df['Nearest Trade'] = order_df['Order ID'].map(trades).fillna('')
Output:
>>> order_df
Order ID Order Time Nearest Trade
0 Order 1 2023-03-08 07:05:00
1 Order 2 2023-03-08 07:10:00 Trade 1001, Trade 1002
2 Order 3 2023-03-08 07:15:00 Trade 1003
3 Order 4 2023-03-08 07:20:00 Trade 1004
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With