Finding unique columns and their indices in NumPy matrix efficiently

Question

I have a very large binary matrix (for example, shape = (210000, 5000)) and want to find unique columns and indices of this matrix. Memory is important to me, so that is why I am looking for a method that is more memory-efficient than np.unique().

I found the below code on Find unique rows in numpy.array. This method is indeed memory-efficient (though slower, which is not as important in my case); however, it finds unique rows (I need columns) and does not return indices of those rows in the original matrix.

import numpy as np

data = np.array([[1, 1, 1, 0, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [1, 1, 1, 0, 0, 0],
                 [1, 1, 1, 1, 1, 0]], dtype=np.byte)
# This method finds unique rows without indices (but I need columns with indices of unique columns).
ncols = data.shape[1]
dtype = data.dtype.descr * ncols 
struct = data.view(dtype)
uniq = np.unique(struct)
uniq = uniq.view(data.dtype).reshape(-1, ncols)

Could someone help me with this?

The desired output for an example above: Matrix with unique columns:

[[1, 1, 0, 0, 0],
 [0, 1, 1, 0, 0],
 [0, 1, 1, 0, 0],
 [1, 1, 0, 0, 0],
 [1, 1, 1, 1, 0]]

Indices of unique columns:

[0, 1, 3, 4, 5]

U12-Forward · Accepted Answer

I'm not sure exactly about your ideal performance speed, but technically you could reduce the amount of processes in your code using pandas, I know the tolist part slows down the code, but it's a simple solution:

import numpy as np
import pandas as pd

data = np.array([[1, 1, 1, 0, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [0, 1, 1, 1, 0, 0],
                 [1, 1, 1, 0, 0, 0],
                 [1, 1, 1, 1, 1, 0]], dtype=np.byte)

s = pd.Series(data.T.tolist())
indices = np.where(~s.duplicated(keep='first'))
output = data[:, indices]

print(indices)
print(output)

Output:

(array([0, 1, 3, 4, 5], dtype=int64),)
[[1, 1, 0, 0, 0],
 [0, 1, 1, 0, 0],
 [0, 1, 1, 0, 0],
 [1, 1, 0, 0, 0],
 [1, 1, 1, 1, 0]]

Finding unique columns and their indices in NumPy matrix efficiently

Tags:

python

memory

unique

matrix

numpy

farid_musa

1 Answers

U12-Forward

Recent Activity

Donate For Us

Finding unique columns and their indices in NumPy matrix efficiently

Tags:

python

memory

unique

matrix

numpy

farid_musa

1 Answers

U12-Forward

Related questions

Recent Activity

Donate For Us