how to predict the cluster label of a new observation using a hierarchical clustering?

Question

I want to study a population of 47532 individuals with 16230 features. Thus I created a matrix with 16230 lines and 47532 columns

>>> import scipy.cluster.hierarchy as hcluster
>>> from scipy.spatial import distance
>>> import sklearn.cluster import AgglomerativeClustering
>>> matrix.shape
(16230, 47532)
# remove all duplicate vectors in order to not waste computation time
>>> uniq_vectors, row_index = np.unique(matrix, return_index=True, axis=0)
>>> uniq_vectors.shape
(22957, 16230)
# compute distance between each observations
>>> distance_matrix = distance.pdist(uniq_vectors, metric='jaccard')
>>> distance_matrix_2d = distance.squareform(distance_matrix, force='tomatrix')
>>> distance_matrix_2d.shape
(22957, 22957)
# Perform linkage
>>> linkage = hcluster.linkage(distance_matrix, method='complete')

So now I can use scikit-learn to perform a clustering

>>> model = AgglomerativeClustering(n_clusters=40, affinity='precomputed', linkage='complete')
>>> cluster_label = model.fit_predict(distance_matrix_2d)

How to predict future observations using this model ?

Indeed AgglomerativeClustering do not own a predict method and it will be too long to compute again the distance for 16230 x (47532 + 1)

Is it possible to compute a distance between new observations and all pre-computed cluster ?

Indeed the use of pdist from scipy will compute the distance n x n In my case I would like compute the distance from one observation o vs n samples o x n

Thanks for your highlight

MaximeKan · Accepted Answer

The answer is simple: you cannot. Hierarchical clustering is not designed to predict cluster labels for new observations. The reason why this is happening is because it just links data points according to their distances and it is not defining "regions" for each cluster.

There are two solutions for you at this stage I believe:

For new data points, find the nearest observation in your data set (using the same distance function as during the training) and assign the same cluster label. This requires a bit more coding, and obviously, it is a bit of a hack. But keep in mind that the results might not make a lot of sense as you will be extrapolating cluster labels using a different methodology than the training procedure.
Use another clustering algorithm! It seems like you are using hierarchical clustering when your use case does not match the model. KMeans could be a good choice, as it explicitly can assign new data points to the closest cluster.

how to predict the cluster label of a new observation using a hierarchical clustering?

Tags:

python-3.x

scipy

scikit-learn

bioinfornatics

1 Answers

MaximeKan

Recent Activity

Donate For Us

how to predict the cluster label of a new observation using a hierarchical clustering?

Tags:

python-3.x

scipy

scikit-learn

bioinfornatics

1 Answers

MaximeKan

Related questions

Recent Activity

Donate For Us