I am trying to compute nearest neighbour clustering on a Scipy sparse matrix returned from scikit-learn's DictVectorizer. However, when I try to compute the distance matrix with scikit-learn I get an error message using 'euclidean' distance through both pairwise.euclidean_distances and pairwise.pairwise_distances. I was under the impression that scikit-learn could calculate these distance matrices.
My matrix is highly sparse with a shape of: <364402x223209 sparse matrix of type <class 'numpy.float64'>
    with 728804 stored elements in Compressed Sparse Row format>.
I have also tried methods such as pdist and kdtree in Scipy but have received other errors of not being able to process the result.
Can anyone please point me to a solution that would effectively allow me calculate the distance matrix and/or the nearest neighbour result?
Some example code:
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import pairwise
import scipy.spatial
file = 'FileLocation'
data = []
FILE = open(file,'r')
for line in FILE:
    templine = line.strip().split(',')
    data.append({'user':str(int(templine[0])),str(int(templine[1])):int(templine[2])})
FILE.close()
vec = DictVectorizer()
X = vec.fit_transform(data)
result = scipy.spatial.KDTree(X)
Error:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/kdtree.py", line 227, in __init__
    self.n, self.m = np.shape(self.data)
ValueError: need more than 0 values to unpack
Similarly, if I run:
scipy.spatial.distance.pdist(X,'euclidean')
I get the following:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/distance.py", line 1169, in pdist
    [X] = _copy_arrays_if_base_present([_convert_to_double(X)])
  File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/scipy/spatial/distance.py", line 113, in _convert_to_double
    X = X.astype(np.double)
ValueError: setting an array element with a sequence.
Finally, running NearestNeighbor in scikit-learn results in a memory error using:
nbrs = NearestNeighbors(n_neighbors=10, algorithm='brute')
First, you can't use KDTree and pdist with sparse matrix, you have to convert it to dense (your choice whether it's your option):
>>> X
<2x3 sparse matrix of type '<type 'numpy.float64'>'
        with 4 stored elements in Compressed Sparse Row format>
>>> scipy.spatial.KDTree(X.todense())
<scipy.spatial.kdtree.KDTree object at 0x34d1e10>
>>> scipy.spatial.distance.pdist(X.todense(),'euclidean')
array([ 6.55743852])
Second, from the docs:
Efficient brute-force neighbors searches can be very competitive for small data samples. However, as the number of samples N grows, the brute-force approach quickly becomes infeasible.
You might want to try 'ball_tree' algorithm and see if it can handle your data.
From your comment:
Since it is a sparse matrix, I would expect there to be solutions to intelligently calculate the distances and store the result in a similarly sparse matrix.
Basic math shows that this is only possible in the case that your input matrix contains a massive number of duplicates, because Euclidean distance is only zero for two exactly equal points (this is actually one of the axioms of distance). So if you remove duplicates this might work.
Otherwise, depending on your problem, you might be able to use sklearn.metrics.pairwise_distances_argmin_min or cosine similarity, X * X.T, which has the reverse ordering compared to Euclidean distance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With