Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find closest float in array for all floats in another array

I have a performance issue while "filtering" an array according to the closest float found in another array.

This is a MWE of the problem:

import numpy as np

def random_data(N):
    # Generate some random data.
    return np.random.uniform(0., 10., N).tolist()

# Data lists.
N1 = 1500
list1 = [random_data(N1), random_data(N1), random_data(N1)]
list2 = random_data(1000)

# Define list1's range.
min_1, max_1 = min(list1[2]), max(list1[2])

# This list will contain the "filtered" list1.
list4 = [[], [], []]

# Go through each element in list2.
for elem2 in list2:

    # If it is located within the list1 range.
    if min_1 <= elem2 <= max_1:

        # Find the closest float in sub-list list1[2] to this float
        # in list2.
        indx, elem1 = min(enumerate(list1[2]), key=lambda x:abs(x[1]-elem2))

        # Store the values in list1 that are associated with the closest float
        # found above.
        list4[0].append(list1[0][indx])
        list4[1].append(list1[1][indx])
        list4[2].append(elem1)

(note that list2 contains fewer elements than list1[2], which is the sub-list I compare it to)

This block works as expected but it is terribly inefficient. I'm certain that the answer lies in the correct application of broadcasting and numpy arrays, but I still haven't managed to get the hang of the former enough to apply it to my problem.

Since I'm after enhancing the performance of this code any solution will do (ie: I'm not bound by an answer making use of broadcasting necessarily)


Add

As a reference, in this similar question I made some time ago Fast weighted euclidean distance between points in arrays, user ali_m used broadcasting to achieve an amazing improvement in performance.

The question is not exactly the same (euclidean distance there instead of absolute value and also the distances in that question had to be weighted) but this question seems to me even simpler that that one.

Couldn't the broadcasting solution ali_m applied to that issue be applied to this one?


Add 2

The answer given by user2357112 with the correction by Eelco Hoogendoorn works very good for my initially defined code. I just realized that I over-simplified it, in my actual code the lists list1[2] and list2 are not necessarily defined in the same range. This would be a more accurate representation of that (this should replace the first lines in the MWE above):

def random_data(N, xi, xf):
    # Generate some random data.
    return np.random.uniform(xi, xf, N).tolist()

# Data lists.
N1 = 1500
list1 = [random_data(N1, 13., 20.), random_data(N1, -1., 4.), random_data(N1, 2., 7.)]
list2 = random_data(1000, 0., 10.)

Now the range of list1[2] is not equal to the range for list2 and thus the answer given fails to reject those points i for which list2[i] > max(list1[2]) or list2[i] < min(list1[2]).

Could the answer be modified to have consider this possibility? I'm very sorry for changing the original code like this, it truly slipped by me.

like image 828
Gabriel Avatar asked Jan 17 '26 18:01

Gabriel


2 Answers

Kd-tree is really overkill here, all you need to do is sort the array and use binary search to find the closest value in the sorted array. I wrote an answer a while back about how to use searchsorted to find the closet value to a target in an array. You can use the same idea here:

import numpy as np

def find_closest(A, target):
    #A must be sorted
    idx = A.searchsorted(target)
    idx = np.clip(idx, 1, len(A)-1)
    left = A[idx-1]
    right = A[idx]
    idx -= target - left < right - target
    return idx

def random_data(shape):
    # Generate some random data.
    return np.random.uniform(0., 10., shape)

def main(data, target):
    order = data[2, :].argsort()
    key = data[2, order]
    target = target[(target >= key[0]) & (target <= key[-1])]
    closest = find_closest(key, target)
    return data[:, order[closest]]

N1 = 1500
array1 = random_data((3, N1))
array2 = random_data(1000)
array2[[10, 20]] = [-1., 100]

array4 = main(array1, array2)
like image 136
Bi Rico Avatar answered Jan 19 '26 18:01

Bi Rico


If you have SciPy, a scipy.spatial.cKDTree can do the job:

import numpy
import scipy.spatial

array1 = numpy.array(list1)
array2 = numpy.array(list2)

# A tree optimized for nearest-neighbor lookup
tree = scipy.spatial.cKDTree(array1[2, ..., numpy.newaxis])

# The distances from the elements of array2 to their nearest neighbors in
# array1, and the indices of those neighbors.
distances, indices = tree.query(array2[..., numpy.newaxis])

array4 = array1[:, indices]

k-d trees are designed for multidimensional data, so this might not be the fastest solution, but it should be pretty darn fast compared to what you have. The k-d tree expects input in the form of a 2D array of points, where data[i] is a 1D array representing the ith point, so the slicing expressions with newaxis are used to put the data into that format. If you need it to be even faster, you could probably do something with numpy.sort and numpy.searchsorted.

If you need to reject data from list2 that falls outside the range of values given by list1[2], that can be accomplished by a preprocessing step:

lowbound = array1[2].min()
highbound = array1[2].max()

querypoints = array2[(array2 >= lowbound) & (array2 <= highbound)]
distances, indices = tree.query(querypoints[..., numpy.newaxis])
like image 29
user2357112 supports Monica Avatar answered Jan 19 '26 19:01

user2357112 supports Monica