I have a Numpy structured array that is sorted by the first column:
x = array([(2, 3), (2, 8), (4, 1)], dtype=[('recod', '<u8'), ('count', '<u4')])
I need to merge records (sum the values of the second column) where
x[n][0] == x[n + 1][0]
In this case, the desired output would be:
x = array([(2, 11), (4, 1)], dtype=[('recod', '<u8'), ('count', '<u4')])
What's the best way to achieve this?
You can use np.unique to get an ID array for each element in the first column and then use np.bincount to perform accumulation on the second column elements based on the IDs -
In [140]: A
Out[140]:
array([[25, 1],
[37, 3],
[37, 2],
[47, 1],
[59, 2]])
In [141]: unqA,idx = np.unique(A[:,0],return_inverse=True)
In [142]: np.column_stack((unqA,np.bincount(idx,A[:,1])))
Out[142]:
array([[ 25., 1.],
[ 37., 5.],
[ 47., 1.],
[ 59., 2.]])
You can avoid np.unique with a combination of np.diff and np.cumsum which might help because np.unique also does sorting internally, which is not needed in this case as the input data is already sorted. The implementation would look something like this -
In [201]: A
Out[201]:
array([[25, 1],
[37, 3],
[37, 2],
[47, 1],
[59, 2]])
In [202]: unq1 = np.append(True,np.diff(A[:,0])!=0)
In [203]: np.column_stack((A[:,0][unq1],np.bincount(unq1.cumsum()-1,A[:,1])))
Out[203]:
array([[ 25., 1.],
[ 37., 5.],
[ 47., 1.],
[ 59., 2.]])
Dicakar's answer cast in structured array form:
In [500]: x=np.array([(25, 1), (37, 3), (37, 2), (47, 1), (59, 2)], dtype=[('recod', '<u8'), ('count', '<u4')])
Find unique values and count duplicates:
In [501]: unqA, idx=np.unique(x['recod'], return_inverse=True)
In [502]: cnt = np.bincount(idx, x['count'])
Make a new structured array and fill the fields:
In [503]: x1 = np.empty(unqA.shape, dtype=x.dtype)
In [504]: x1['recod'] = unqA
In [505]: x1['count'] = cnt
In [506]: x1
Out[506]:
array([(25, 1), (37, 5), (47, 1), (59, 2)],
dtype=[('recod', '<u8'), ('count', '<u4')])
There is a recarray function that builds an array from a list of arrays:
In [507]: np.rec.fromarrays([unqA,cnt],dtype=x.dtype)
Out[507]:
rec.array([(25, 1), (37, 5), (47, 1), (59, 2)],
dtype=[('recod', '<u8'), ('count', '<u4')])
Internally it does the same thing - build an empty array of the right size and dtype, and then loop over over the dtype fields. A recarray is just a structured array in a specialized array subclass wrapper.
There are two ways of populating a structured array (especially with a diverse dtype) - with a list of tuples as you did with x, and field by field.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With