I have a very large binary matrix (for example, shape = (210000, 5000)) and want to find unique columns and indices of this matrix. Memory is important to me, so that is why I am looking for a method that is more memory-efficient than np.unique().
I found the below code on Find unique rows in numpy.array. This method is indeed memory-efficient (though slower, which is not as important in my case); however, it finds unique rows (I need columns) and does not return indices of those rows in the original matrix.
import numpy as np
data = np.array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]], dtype=np.byte)
# This method finds unique rows without indices (but I need columns with indices of unique columns).
ncols = data.shape[1]
dtype = data.dtype.descr * ncols
struct = data.view(dtype)
uniq = np.unique(struct)
uniq = uniq.view(data.dtype).reshape(-1, ncols)
Could someone help me with this?
The desired output for an example above: Matrix with unique columns:
[[1, 1, 0, 0, 0],
[0, 1, 1, 0, 0],
[0, 1, 1, 0, 0],
[1, 1, 0, 0, 0],
[1, 1, 1, 1, 0]]
Indices of unique columns:
[0, 1, 3, 4, 5]
I'm not sure exactly about your ideal performance speed, but technically you could reduce the amount of processes in your code using pandas, I know the tolist part slows down the code, but it's a simple solution:
import numpy as np
import pandas as pd
data = np.array([[1, 1, 1, 0, 0, 0],
[0, 1, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 0],
[1, 1, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 0]], dtype=np.byte)
s = pd.Series(data.T.tolist())
indices = np.where(~s.duplicated(keep='first'))
output = data[:, indices]
print(indices)
print(output)
Output:
(array([0, 1, 3, 4, 5], dtype=int64),)
[[1, 1, 0, 0, 0],
[0, 1, 1, 0, 0],
[0, 1, 1, 0, 0],
[1, 1, 0, 0, 0],
[1, 1, 1, 1, 0]]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With