I'm trying to run a custom kmeans clustering algorithm and am having trouble getting the document frequency for each column(term) of a 2-d numpy array by cluster. My current algorithm has two numpy arrays, a raw dataset that lists the documents by terms [2000L,9500L] and one that is the clustering assignment [2000L,]. There are 5 clusters. What I need to do is create an array that lists the document frequency for each cluster - basically a count in each column where the column number matches a row number in a different array. The output will be a [5L, 9500L] array (clusters x terms). I'm having trouble finding a way to do the equivalent of a countif and group by. Here is some sample data and the output I would like if I ran it with only 2 clusters:
import numpy as np
dataset = np.array[[1,2,0,3,0],[0,2,0,0,3],[4,5,2,3,0],[0,0,2,3,0]]
clusters = np.array[0,1,1,0]
#run code here to get documentFrequency
print documentFrequency
>> [1,1,1,2,0],[1,2,1,1,1]
my thoughts would be to select out the specific rows that match each cluster, because then counting should be easy. For example, if I could split the data into the following arrays:
cluster0 = np.array[[1,2,0,3,0],[0,0,2,3,0]]
cluster1 = np.array[[0,2,0,0,3],[4,5,2,3,0]]
Any direction or pointers would be much appreciated!
I don't think there is any easy way to vectorize your code, but if you have only a few clusters you could do the obvious:
>>> cluster_count = np.max(clusters)+1
>>> doc_freq = np.zeros((cluster_count, dataset.shape[1]), dtype=dataset.dtype)
>>> for j in xrange(cluster_count):
... doc_freq[j] = np.sum(dataset[clusters == j], axis=0)
...
>>> doc_freq
array([[1, 2, 2, 6, 0],
[4, 7, 2, 3, 3]])
As @Jaime says, if you have only a few clusters it makes sense to use the usual trick of manually looping over the smallest axis length. Often that gets you most of the benefits of full vectorization with a lot less of the headache that comes with being clever.
That said, when you find yourself wanting groupby, you're often in a domain in which a higher-level tool like pandas comes in very handy:
>>> pd.DataFrame(dataset).groupby(clusters).sum()
0 1 2 3 4
0 1 2 2 6 0
1 4 7 2 3 3
And you can easily fall back to an ndarray if needed:
>>> pd.DataFrame(dataset).groupby(clusters).sum().values
array([[1, 2, 2, 6, 0],
[4, 7, 2, 3, 3]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With