Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to find all indexes of matching values between two 1D arrays (with duplicates)

Question description

Lets say we have two simple arrays:

query = np.array([100, 4000, 500, 700, 400, 100])
match = np.array([6, 100, 4000, 100, 10, 8, 10])

I want to find the indexes of all matching values between the query and match. So in this case the result would be:

value   query   match
100        0    1
100        0    3
100        5    1
100        5    3
4000       1    2

In reality these arrays will contain millions of items


"Stupid" loop solution

qs = []
query_locs = []
match_locs = []

for i in np.arange(query.size):
    q = query[i]
    # Get matching indexes in "match"
    match_loc = np.where(match == q)[0]
    n = match_loc.size
    # Update location arrays
    match_locs.extend(match_loc)
    query_locs.extend(np.repeat(i,n))
    # Store the matching value
    qs.extend(np.repeat(q,n))

result = np.vstack((qs, query_locs, match_locs)).T
print(result)
 [[ 100    0    1]
 [ 100    0    3]
 [4000    1    2]
 [ 100    5    1]
 [ 100    5    3]]

(Maybe numba could make this loop pretty fast however when I tried this I got some errors about the signatures, so not sure about that)


Numpy buildins

There are quite some buildin numpy function to solve this problem for unique values, like using searchsorted, intersect1d, however, as also described in the doc, they "Return the sorted, unique values" and thus do not take duplicates into account. Some examples on StackOverflow for this problem with unique values:

  • NumPy: Comparing Elements in Two Arrays
  • Efficient way to compute intersecting values between two numpy arrays

I could imagine there would be a faster way to do this with numpy instead of a loop, so curious to see an answer!

like image 295
CodeNoob Avatar asked Oct 28 '25 07:10

CodeNoob


1 Answers

You may transform 1d-arrays to dataframes and make a join, like this:

query = np.array([100, 4000, 500, 700, 400, 100])
match = np.array([6, 100, 4000, 100, 10, 8, 10])
dfquery = pd.DataFrame(range(len(query)), index=query, columns=['query'])
dfmatch = pd.DataFrame(range(len(match)), index=match, columns=['match'])
dfquery.join(dfmatch, how='inner')

Result:

    query   match
100     0       1
100     0       3
100     5       1
100     5       3
4000    1       2
like image 193
washolive Avatar answered Oct 30 '25 22:10

washolive



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!