i am trying to compute a mean of two datasets, identified by a certain column. Here it is the column AA2. The trivial solution is to first identify the dataset, then compute the mean over that dataset. However this doesn't look nice in python. Is there a way numpy could do this for me?
my dataset:
Number AA1 AA2 AA3 Atom amou mean_shift stddev 187 ALA GLU LEU C 1 119.47 0.00 187 ALA GLU LEU O 1 8.42 0.00 188 ALA GLU LYS C 1 120.67 0.00 188 ALA GLU LYS O 1 9.11 0.00 777 ARG GLN ARG C 1 117.13 0.00 777 ARG GLN ARG O 1 8.48 0.00
what i want:
187 GLU C 1 (119.47+120.67+117.13)/3 0.00 187 GLU O 1 (8.42+9.11+8.48)/3 0.00
Edit: I cleared up the example. The mean is computed over the column mean_shift, but only over those rows where the atom is the same. My (not so nice version) of this is:
i,j = 0,0
# iterate over all keys
for j in range(1, len(data_one)):
key = data_two[j][3]
aminoacid = data_two[j][5]
print key, aminoacid
stop
keyeddata=[]
for i in range(1, len(data_one)):
if (data_one[i][2]==key):
keyeddata.append(data_one[i])
print mean(keyeddata[6])
cheers, and thanks
You can do it easily with structured arrays, like this:
import numpy as np
# Test data
data = [
(187, "ALA","GLU", "LEU", "C", 1, 119.47, 0.00),
(187, "ALA","GLU", "LEU", "O", 1, 8.42, 0.00),
(188, "ALA","GLU", "LYS", "C", 1, 120.67, 0.00),
(188, "ALA","GLU", "LYS", "O", 1, 9.11, 0.00),
(777, "ARG","GLN", "ARG", "C", 1, 117.13, 0.00),
(777, "ARG","GLN", "ARG", "O", 1, 8.48, 0.00),
]
# Structure definition
my_dtype = [
('Number', 'i4'),
( 'AA1', 'a3'),
( 'AA2', 'a3'),
( 'AA3', 'a3'),
( 'Atom', 'a1'),
( 'amou', 'i4'),
( 'mean', 'f4'),
( 'stddev', 'f4')
]
a = np.array(data, dtype = my_dtype)
Now, with that a array, you can easily extract groups. First, let's find out the unique elements for a certain attribute:
>>> np.unique(a['AA2'])
array(['GLN', 'GLU'],
dtype='|S3')
Now, you can group data by matching the attribute. Eg:
# This gives you a mask
>>> a['AA2'] == 'GLN'
array([False, False, False, False, True, True], dtype=bool)
# that you can apply to the array itself
>>> a[a['AA2'] == 'GLN']
array([(777, 'ARG', 'GLN', 'ARG', 'C', 1, 117.12999725341797, 0.0),
(777, 'ARG', 'GLN', 'ARG', 'O', 1, 8.4799995422363281, 0.0)],
dtype=[('Number', '<i4'), ('AA1', '|S3'), ('AA2', '|S3'), ('AA3', '|S3'),
('Atom', '|S1'), ('amou', '<i4'), ('mean', '<f4'), ('stddev', '<f4')])
From there you can apply any calculation to an arbitrary attribute. Say, a mean of means:
>>> gln = a[a['AA2'] == 'GLN']
>>> gln['mean'].mean()
62.805000305175781
Edit:
Now, to select data following more than one criteria, keep into mind the previous a['AA2'] == 'GLN' example:
>>> a['Atom'] == 'C'
array([ True, False, True, False, True, False], dtype=bool)
>>> np.logical_and(a['Atom'] == 'C', a['AA2'] == 'GLN')
array([False, False, False, False, True, False], dtype=bool)
# Which of course would give us the only row that fits:
>>> a[np.logical_and(a['Atom'] == 'C', a['AA2'] == 'GLN')]
array([(777, 'ARG', 'GLN', 'ARG', 'C', 1, 117.12999725341797, 0.0)], ...)
You will probably want to do some combinatorics on the criteria (using itertools or similar) to automate the process, and you may want also to have a look here to see the available logic functions in NumPy.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With