I have two arrays like this:
A = [[111, ...],          B = [[222, ...],
     [222, ...],               [111, ...],
     [333, ...],               [333, ...],
     [555, ...]]               [444, ...],
                               [555, ...]]
Where the first column contains identifiers and the remaining columns some data, where the number of columns of B is much larger than the number of columns of A. The identifiers are unique. The number of rows in A can be less than in B, so that in some cases empty spacer rows would be necessary.
I am looking for an efficient way to match the rows of matrix A to matrix B, so that that the result would look like that:
A = [[222, ...],
     [111, ...],
     [333, ...],
     [nan, nan], #could be any unused value
     [555, ...]]
I could just sort both matrices or write a for loop, but both approaches seem clumsy... Are there better implementations?
Here's a vectorized approach using np.searchsorted -
# Store the sorted indices of A
sidx = A[:,0].argsort()
# Find the indices of col-0 of B in col-0 of sorted A
l_idx = np.searchsorted(A[:,0],B[:,0],sorter = sidx)
# Create a mask corresponding to all those indices that indicates which indices
# corresponding to B's col-0 match up with A's col-0
valid_mask = l_idx != np.searchsorted(A[:,0],B[:,0],sorter = sidx,side='right')
# Initialize output array with NaNs. 
# Use l_idx to set rows from A into output array. Use valid_mask to select 
# indices from l_idx and output rows that are to be set.
out = np.full((B.shape[0],A.shape[1]),np.nan)
out[valid_mask] = A[sidx[l_idx[valid_mask]]]
Please note that valid_mask could also be created using np.in1d : np.in1d(B[:,0],A[:,0]) for a more intuitive answer. But, we are using np.searchsorted as that's better in terms of performance as also disscused in greater detail in this other solution.
Sample run -
In [184]: A
Out[184]: 
array([[45, 11, 86],
       [18, 74, 59],
       [30, 68, 13],
       [55, 47, 78]])
In [185]: B
Out[185]: 
array([[45, 11, 88],
       [55, 83, 46],
       [95, 87, 77],
       [30,  9, 37],
       [14, 97, 98],
       [18, 48, 53]])
In [186]: out
Out[186]: 
array([[ 45.,  11.,  86.],
       [ 55.,  47.,  78.],
       [ nan,  nan,  nan],
       [ 30.,  68.,  13.],
       [ nan,  nan,  nan],
       [ 18.,  74.,  59.]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With