Nicer way to compute means of a set with numpy

Question

i am trying to compute a mean of two datasets, identified by a certain column. Here it is the column AA2. The trivial solution is to first identify the dataset, then compute the mean over that dataset. However this doesn't look nice in python. Is there a way numpy could do this for me?

my dataset:

   Number       AA1   AA2 AA3   Atom     amou   mean_shift          stddev
   187            ALA GLU LEU   C             1         119.47           0.00
   187            ALA GLU LEU   O             1           8.42           0.00
   188            ALA GLU LYS   C             1         120.67           0.00
   188            ALA GLU LYS   O             1           9.11           0.00
   777            ARG GLN ARG   C             1         117.13           0.00
   777            ARG GLN ARG   O             1           8.48           0.00

what i want:

   187             GLU    C             1        (119.47+120.67+117.13)/3 0.00
   187             GLU    O             1          (8.42+9.11+8.48)/3           0.00

Edit: I cleared up the example. The mean is computed over the column mean_shift, but only over those rows where the atom is the same. My (not so nice version) of this is:

i,j = 0,0
# iterate over all keys
for j in range(1, len(data_one)):
        key = data_two[j][3]
        aminoacid = data_two[j][5]
        print key, aminoacid
        stop
        keyeddata=[]
        for i in range(1, len(data_one)):
                if (data_one[i][2]==key):
                        keyeddata.append(data_one[i])
                print mean(keyeddata[6])

cheers, and thanks

Ricardo Cárdenes · Accepted Answer

You can do it easily with structured arrays, like this:

import numpy as np

# Test data
data = [
   (187, "ALA","GLU", "LEU", "C", 1, 119.47, 0.00),
   (187, "ALA","GLU", "LEU", "O", 1, 8.42, 0.00),
   (188, "ALA","GLU", "LYS", "C", 1, 120.67, 0.00),
   (188, "ALA","GLU", "LYS", "O", 1, 9.11, 0.00),
   (777, "ARG","GLN", "ARG", "C", 1, 117.13, 0.00),
   (777, "ARG","GLN", "ARG", "O", 1, 8.48, 0.00),
   ]

# Structure definition
my_dtype = [
    ('Number',  'i4'),
    (  'AA1',   'a3'),
    (  'AA2',   'a3'),
    (  'AA3',   'a3'),
    ( 'Atom',   'a1'),
    ( 'amou',   'i4'),
    ( 'mean',   'f4'),
    ( 'stddev', 'f4')
           ]

a = np.array(data, dtype = my_dtype)

Now, with that a array, you can easily extract groups. First, let's find out the unique elements for a certain attribute:

>>> np.unique(a['AA2'])
array(['GLN', 'GLU'], 
      dtype='|S3')

Now, you can group data by matching the attribute. Eg:

# This gives you a mask
>>> a['AA2'] == 'GLN'
array([False, False, False, False,  True,  True], dtype=bool)
# that you can apply to the array itself
>>> a[a['AA2'] == 'GLN']
array([(777, 'ARG', 'GLN', 'ARG', 'C', 1, 117.12999725341797, 0.0),
       (777, 'ARG', 'GLN', 'ARG', 'O', 1, 8.4799995422363281, 0.0)], 
      dtype=[('Number', '<i4'), ('AA1', '|S3'), ('AA2', '|S3'), ('AA3', '|S3'),
             ('Atom', '|S1'), ('amou', '<i4'), ('mean', '<f4'), ('stddev', '<f4')])

From there you can apply any calculation to an arbitrary attribute. Say, a mean of means:

>>> gln = a[a['AA2'] == 'GLN']
>>> gln['mean'].mean()
62.805000305175781

Edit: Now, to select data following more than one criteria, keep into mind the previous a['AA2'] == 'GLN' example:

>>> a['Atom'] == 'C'
array([ True, False,  True, False,  True, False], dtype=bool)
>>> np.logical_and(a['Atom'] == 'C', a['AA2'] == 'GLN')
array([False, False, False, False,  True, False], dtype=bool)

# Which of course would give us the only row that fits:
>>> a[np.logical_and(a['Atom'] == 'C', a['AA2'] == 'GLN')]
array([(777, 'ARG', 'GLN', 'ARG', 'C', 1, 117.12999725341797, 0.0)], ...)

You will probably want to do some combinatorics on the criteria (using itertools or similar) to automate the process, and you may want also to have a look here to see the available logic functions in NumPy.

Nicer way to compute means of a set with numpy

Tags:

python

numpy

tarrasch

1 Answers

Ricardo Cárdenes

Recent Activity

Donate For Us

Nicer way to compute means of a set with numpy

Tags:

python

numpy

tarrasch

1 Answers

Ricardo Cárdenes

Related questions

Recent Activity

Donate For Us