euclidean distance calculation using Python and Dask

Question

I'm attempting to identify elements in the euclidean distance matrix that fall under a certain threshold. I then take the positional arguments for this search and use them to compare elements in a second array (for sake of demonstration this array is the first eigenvector of PCA, but the sort is the most relevant part for my question). The application needs to be applicable for an unknown number of observations, but should run effectively on several million.

#

import numpy as np
from scipy.spatial.distance import cdist

threshold = 10
data = np.random.uniform((1, 2, 3), 5000)

searchValues = np.where(cdist(data, data) < threshold)

#

My problem is two fold.

Firstly the euclidean distance matrix quickly becomes too large for simply applying scipy.spatial.distance.cdist(). To solve this issue I apply the cdist function in batches over the dataset and implement the search iteratively.

#

cdist(data, data) 

Traceback (most recent call last):
  File "C:\Users	l928yx\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2862, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-10-fb93ae543712>", line 1, in <module>
    cdist(data, data)
  File "C:\Users	l928yx\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\spatial\distance.py", line 2142, in cdist
    dm = np.zeros((mA, mB), dtype=np.double)
MemoryError

#

The second problem is a runtime issue that results from constructing distance matrix iteratively. When I institute my iterative approach the runtime increases exponentially. This isn't unexpected due to the nature of the iterative approach.

#

import numpy as np
import dask.array as da
from scipy.spatial.distance import cdist
import itertools
import timeit

threshold = 10
data = np.random.uniform(1, 100, (200000,40))  #Build random data
data = da.asarray(data)

it = round(data.shape[0]/10000)
dataArrays = [data[i*10000:(i+1)*10000] for i in range(0, it)]

comparisons = itertools.combinations(dataArrays, 2)

start = timeit.default_timer()
searchvalues = []
for comparison in comparisons:
    searchvalues.append(np.where(cdist(comparison[0], comparison[1]) < threshold))
time = timeit.default_timer() - start
print(time)

#

Neither of these issues are unexpected due to the nature of the problem. To try and offset both problems I've tried using dask to implement both a large data framework in python, and insert parallelization in the batch process. However, this hasn't resulted in a significant improvement in the time calculation, and I have a pretty strict memory limitation with this iterative method in dask (requiring taking in batches of 1000 obs at a time.

from dask.diagnostics import ProgressBar
import dask.delayed
import dask.bag

@dask.delayed
def eucDist(comparison):
    return da.asarray(cdist(comparison[0], comparison[1]))

@dask.delayed
def findValues(euclideanMatrix):
    return np.where(euclideanMatrix < threshold)

start = timeit.default_timer()
searchvalues = []
test = []
for comparison in comparisons:
    comp = dask.delayed(eucDist)(comparison)
    test.append(comp)

look = []

with ProgressBar():
    for element in test:
        look.append(dask.delayed(findValues)(element).compute())

I'm hoping that I can parallelize the comparisons to increase my speed, but I'm not sure how to implement that in python. Any help with that, or any recommendations for how I can improve the initial comparison code would be appreciated.

Rahul Sharma · Accepted Answer

You can calculate the Euclidean distance in Dask by using dask_distance.euclidean(x,y).

euclidean distance calculation using Python and Dask

Tags:

python

numpy

euclidean-distance

dask

dask-delayed

Quinn Anderson

1 Answers

Rahul Sharma

Recent Activity

Donate For Us

euclidean distance calculation using Python and Dask

Tags:

python

numpy

euclidean-distance

dask

dask-delayed

Quinn Anderson

1 Answers

Rahul Sharma

Related questions

Recent Activity

Donate For Us