I want to study a population of 47532 individuals with 16230 features. Thus I created a matrix with 16230 lines and 47532 columns
>>> import scipy.cluster.hierarchy as hcluster
>>> from scipy.spatial import distance
>>> import sklearn.cluster import AgglomerativeClustering
>>> matrix.shape
(16230, 47532)
# remove all duplicate vectors in order to not waste computation time
>>> uniq_vectors, row_index = np.unique(matrix, return_index=True, axis=0)
>>> uniq_vectors.shape
(22957, 16230)
# compute distance between each observations
>>> distance_matrix = distance.pdist(uniq_vectors, metric='jaccard')
>>> distance_matrix_2d = distance.squareform(distance_matrix, force='tomatrix')
>>> distance_matrix_2d.shape
(22957, 22957)
# Perform linkage
>>> linkage = hcluster.linkage(distance_matrix, method='complete')
So now I can use scikit-learn to perform a clustering
>>> model = AgglomerativeClustering(n_clusters=40, affinity='precomputed', linkage='complete')
>>> cluster_label = model.fit_predict(distance_matrix_2d)
How to predict future observations using this model ?
Indeed AgglomerativeClustering
do not own a predict
method and it will be too long to compute again the distance for 16230 x (47532 + 1)
Is it possible to compute a distance between new observations and all pre-computed cluster ?
Indeed the use of pdist
from scipy will compute the distance n x n
In my case I would like compute the distance from one observation o
vs n
samples o x n
Thanks for your highlight
The answer is simple: you cannot. Hierarchical clustering is not designed to predict cluster labels for new observations. The reason why this is happening is because it just links data points according to their distances and it is not defining "regions" for each cluster.
There are two solutions for you at this stage I believe:
KMeans
could be a good choice, as it explicitly can assign new data points to the closest cluster.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With