I'm attempting to identify elements in the euclidean distance matrix that fall under a certain threshold. I then take the positional arguments for this search and use them to compare elements in a second array (for sake of demonstration this array is the first eigenvector of PCA, but the sort is the most relevant part for my question). The application needs to be applicable for an unknown number of observations, but should run effectively on several million.
#import numpy as np
from scipy.spatial.distance import cdist
threshold = 10
data = np.random.uniform((1, 2, 3), 5000)
searchValues = np.where(cdist(data, data) < threshold)
#
My problem is two fold.
Firstly the euclidean distance matrix quickly becomes too large for simply applying scipy.spatial.distance.cdist(). To solve this issue I apply the cdist function in batches over the dataset and implement the search iteratively.
#cdist(data, data)
Traceback (most recent call last):
File "C:\Users\tl928yx\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2862, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-10-fb93ae543712>", line 1, in <module>
cdist(data, data)
File "C:\Users\tl928yx\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\spatial\distance.py", line 2142, in cdist
dm = np.zeros((mA, mB), dtype=np.double)
MemoryError
#
The second problem is a runtime issue that results from constructing distance matrix iteratively. When I institute my iterative approach the runtime increases exponentially. This isn't unexpected due to the nature of the iterative approach.
#import numpy as np
import dask.array as da
from scipy.spatial.distance import cdist
import itertools
import timeit
threshold = 10
data = np.random.uniform(1, 100, (200000,40)) #Build random data
data = da.asarray(data)
it = round(data.shape[0]/10000)
dataArrays = [data[i*10000:(i+1)*10000] for i in range(0, it)]
comparisons = itertools.combinations(dataArrays, 2)
start = timeit.default_timer()
searchvalues = []
for comparison in comparisons:
searchvalues.append(np.where(cdist(comparison[0], comparison[1]) < threshold))
time = timeit.default_timer() - start
print(time)
#
Neither of these issues are unexpected due to the nature of the problem. To try and offset both problems I've tried using dask to implement both a large data framework in python, and insert parallelization in the batch process. However, this hasn't resulted in a significant improvement in the time calculation, and I have a pretty strict memory limitation with this iterative method in dask (requiring taking in batches of 1000 obs at a time.
from dask.diagnostics import ProgressBar
import dask.delayed
import dask.bag
@dask.delayed
def eucDist(comparison):
return da.asarray(cdist(comparison[0], comparison[1]))
@dask.delayed
def findValues(euclideanMatrix):
return np.where(euclideanMatrix < threshold)
start = timeit.default_timer()
searchvalues = []
test = []
for comparison in comparisons:
comp = dask.delayed(eucDist)(comparison)
test.append(comp)
look = []
with ProgressBar():
for element in test:
look.append(dask.delayed(findValues)(element).compute())
I'm hoping that I can parallelize the comparisons to increase my speed, but I'm not sure how to implement that in python. Any help with that, or any recommendations for how I can improve the initial comparison code would be appreciated.
You can calculate the Euclidean distance in Dask by using dask_distance.euclidean(x,y).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With