Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding unique columns and their indices in NumPy matrix efficiently

I have a very large binary matrix (for example, shape = (210000, 5000)) and want to find unique columns and indices of this matrix. Memory is important to me, so that is why I am looking for a method that is more memory-efficient than np.unique().

I found the below code on Find unique rows in numpy.array. This method is indeed memory-efficient (though slower, which is not as important in my case); however, it finds unique rows (I need columns) and does not return indices of those rows in the original matrix.

import numpy as np

data = np.array([[1, 1, 1, 0, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [1, 1, 1, 0, 0, 0],
                 [1, 1, 1, 1, 1, 0]], dtype=np.byte)
# This method finds unique rows without indices (but I need columns with indices of unique columns).
ncols = data.shape[1]
dtype = data.dtype.descr * ncols 
struct = data.view(dtype)
uniq = np.unique(struct)
uniq = uniq.view(data.dtype).reshape(-1, ncols)

Could someone help me with this?

The desired output for an example above: Matrix with unique columns:

[[1, 1, 0, 0, 0],
 [0, 1, 1, 0, 0],
 [0, 1, 1, 0, 0],
 [1, 1, 0, 0, 0],
 [1, 1, 1, 1, 0]]

Indices of unique columns:

[0, 1, 3, 4, 5]
like image 608
farid_musa Avatar asked Feb 03 '26 15:02

farid_musa


1 Answers

I'm not sure exactly about your ideal performance speed, but technically you could reduce the amount of processes in your code using pandas, I know the tolist part slows down the code, but it's a simple solution:

import numpy as np
import pandas as pd

data = np.array([[1, 1, 1, 0, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [1, 1, 1, 0, 0, 0],
                 [1, 1, 1, 1, 1, 0]], dtype=np.byte)

s = pd.Series(data.T.tolist())
indices = np.where(~s.duplicated(keep='first'))
output = data[:, indices]

print(indices)
print(output)

Output:

(array([0, 1, 3, 4, 5], dtype=int64),)
[[1, 1, 0, 0, 0],
 [0, 1, 1, 0, 0],
 [0, 1, 1, 0, 0],
 [1, 1, 0, 0, 0],
 [1, 1, 1, 1, 0]]
like image 200
U12-Forward Avatar answered Feb 06 '26 03:02

U12-Forward



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!