Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inexpensive way to add time series intensity in python pandas dataframe

I am trying to sum (and plot) a total from functions which change states at different times using Python's Pandas.DataFrame. For example:

Suppose we have 3 people whose states can be a) holding nothing, b) holding a 5 pound weight, and c) holding a 10 pound weight. Over time, these people pick weights up and put them down. I want to plot the total amount of weight being held. So, given:

My brute forece attempt:

import pandas as ps
import math
import numpy as np

person1=[3,0,10,10,10,10,10]
person2=[4,0,20,20,25,25,40]
person3=[5,0,5,5,15,15,40]
allPeopleDf=ps.DataFrame(np.array(zip(person1,person2,person3)).T)
allPeopleDf.columns=['count','start1', 'end1', 'start2', 'end2', 'start3','end3']
allPeopleDfNoCount=allPeopleDf[['start1', 'end1', 'start2', 'end2', 'start3','end3']]
uniqueTimes=sorted(ps.unique(allPeopleDfNoCount.values.ravel()))
possibleStates=[-1,0,1,2] #extra state 0 for initialization
stateData={}
comboStates={}
#initialize dict to add up all of the stateData
for time in uniqueTimes:
    comboStates[time]=0.0
allPeopleDf['track']=-1
allPeopleDf['status']=-1
numberState=len(possibleStates)

starti=-1
endi=0
startState=0
for i in range(3):
    starti=starti+2
    print starti
    endi=endi+2
    for time in uniqueTimes:
        def helper(row):
            start=row[starti]
            end=row[endi]
            track=row[7]
            if start <= time and time < end:
                return possibleStates[i+1]
            else:
                return possibleStates[0]
        def trackHelp(row):
            status=row[8]
            track=row[7]    
            if track<=status:
                return status
            else:
                return track
        def Multiplier(row):
            x=row[8]
            if x==0:
                return 0.0*row[0]
            if x==1:
                return 5.0*row[0]
            if x==2:
                return 10.0*row[0]
            if x==-1:#numeric place holder for non-contributing
                return 0.0*row[0]    
        allPeopleDf['status']=allPeopleDf.apply(helper,axis=1)
        allPeopleDf['track']=allPeopleDf.apply(trackHelp,axis=1)
        stateData[time]=allPeopleDf.apply(Multiplier,axis=1).sum()
    for k,v in stateData.iteritems():
        comboStates[k]=comboStates.get(k,0)+v
print allPeopleDf
print stateData
print comboStates

Plots of weight being held over time might look like the following:

enter image description here

And the sum of the intensities over time might look like the black line in the following:

enter image description here

with the black line defined with the Cartesian points: (0,0 lbs),(5,0 lbs),(5,5 lbs),(15,5 lbs),(15,10 lbs),(20,10 lbs),(20,15 lbs),(25,15 lbs),(25,20 lbs),(40,20 lbs). However, I'm flexible and don't necessarily need to define the combined intensity line as a set of Cartesian points. The unique times can be found with: print list(set(uniqueTimes).intersection(allNoCountT[1].values.ravel())).sort() ,but I can't come up with a slick way of getting the corresponding intensity values.

I started out with a very ugly function to break apart each "person's" graph so that all people had start and stop times (albeit many stop and start times without state change) at the same time, and then I could add up all the "chunks" of time. This was cumbersome; there has to be a slick pandas way of handling this. If anyone can offer a suggestion or point me to another SO like that I might have missed, I'd appreciate the help!

In case my simplified example isn't clear, another might be plotting the intensity of sound coming from a piano: there are many notes being played for different durations with different intensities. I would like the sum of intensity coming from the piano over time. While my example is simplistic, I need a solution that is more on the scale of a piano song: thousands of discrete intensity levels per key, and many keys contributing over the course of a song.

Edit--Implementation of mgab's provided solution:

import pandas as ps
import math
import numpy as np

person1=['person1',3,0.0,10.0,10.0,10.0,10.0,10.0]
person2=['person2',4,0,20,20,25,25,40]
person3=['person3',5,0,5,5,15,15,40]
allPeopleDf=ps.DataFrame(np.array(zip(person1,person2,person3)).T)
allPeopleDf.columns=['id','intensity','start1', 'end1', 'start2', 'end2', 'start3','end3']
allPeopleDf=ps.melt(allPeopleDf,id_vars=['intensity','id'])
allPeopleDf.columns=['intensity','id','timeid','time']
df=ps.DataFrame(allPeopleDf).drop('timeid',1)
df[df.id=='person1'].drop('id',1) #easier to visualize one id for check
df['increment']=df.groupby('id')['intensity'].transform( lambda x: x.sub(x.shift(), fill_value= 0 ))

TypeError: unsupported operand type(s) for -: 'str' and 'int'

End Edit


2 Answers

Going for the piano keys example, lets assume you have three keys, with 30 levels of intensity.

I would try to keep the data in this format:

import pandas as pd
df = pd.DataFrame([[10,'A',5],
                   [10,'B',7],
                   [13,'C',10],
                   [15,'A',15],
                   [20,'A',7],
                   [23,'C',0]], columns=["time", "key", "intensity"])

   time   key  intensity
0    10     A          5
1    10     B          7
2    13     C         10
3    15     A         15
4    20     A          7
5    23     C          0

where you record every change in intensity of any of the keys. From here you can already get the Cartesian coordinates for each individual key as (time,intensity) pairs

df[df.key=="A"].drop('key',1)

   time  intensity
0    10          5
3    15         15
4    20          7

Then, you can easily create a new column increment that will indicate the change in intensity that occurred for that key at that time point (intensity indicates just the new value of intensity)

df["increment"]=df.groupby("key")["intensity"].transform(
                             lambda x: x.sub(x.shift(), fill_value= 0 ))
df

   time key  intensity  increment
0    10   A          5          5
1    10   B          7          7
2    13   C         10         10
3    15   A         15         10
4    20   A          7         -8
5    23   C          0        -10

And then, using this new column, you can generate the (time, total_intensity) pairs to use as Cartesian coordinates

df.groupby("time").sum()["increment"].cumsum()

time
10      12
13      22
15      32
20      24
23      14
dtype: int64

EDIT: applying the specific data presented in question

Assuming the data comes as a list of values, starting with the element id (person/piano key), then a factor multiplying the measured weight/intensities for this element, and then pairs of time values indicating the start and end of a series of known states (weight being carried/intensity being emitted). Not sure if I got the data format right. From your question:

data1=['person1',3,0.0,10.0,10.0,10.0,10.0,10.0]
data2=['person2',4,0,20,20,25,25,40]
data3=['person3',5,0,5,5,15,15,40]

And if we know the weight/intensity of each one of the states, we can define:

known_states = [5, 10, 15]
DF_columns = ["time", "id", "intensity"]

Then, the easiest way I came up to load the data includes this function:

import pandas as pd

def read_data(data, states, columns):
    id = data[0]
    factor = data[1]
    reshaped_data = []
    for i in xrange(len(states)):
        j += 2+2*i
        if not data[j] == data[j+1]:
            reshaped_data.append([data[j], id, factor*states[i]])
            reshaped_data.append([data[j+1], id, -1*factor*states[i]])
    return pd.DataFrame(reshaped_data, columns=columns)

Notice that the if not data[j] == data[j+1]: avoids loading data to the dataframe when start and end times for a given state are equal (seems uninformative, and wouldn't appear in your plots anyway). But take it out if you still want these entries.

Then, you load the data:

df = read_data(data1, known_states, DF_columns)
df = df.append(read_data(data2, known_states, DF_columns), ignore_index=True)
df = df.append(read_data(data3, known_states, DF_columns), ignore_index=True)
# and so on...

And then you're right at the beginning of this answer (substituting 'key' by 'id' and the ids, of course)

like image 80
mgab Avatar answered Dec 13 '25 20:12

mgab


Appears to be what .sum() is for:

In [10]:

allPeopleDf.sum()
Out[10]:
aStart     0
aEnd      35
bStart    35
bEnd      50
cStart    50
cEnd      90
dtype: int32
like image 42
CT Zhu Avatar answered Dec 13 '25 20:12

CT Zhu



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!