I'm looking for solutions to speed up a function I have written to loop through a pandas dataframe and compare column values between the current row and the previous row.
As an example, this is a simplified version of my problem:
   User  Time                 Col1  newcol1  newcol2  newcol3  newcol4
0     1     6     [cat, dog, goat]        0        0        0        0
1     1     6         [cat, sheep]        0        0        0        0
2     1    12        [sheep, goat]        0        0        0        0
3     2     3          [cat, lion]        0        0        0        0
4     2     5  [fish, goat, lemur]        0        0        0        0
5     3     9           [cat, dog]        0        0        0        0
6     4     4          [dog, goat]        0        0        0        0
7     4    11                [cat]        0        0        0        0
At the moment I have a function which loops through and calculates values for 'newcol1' and 'newcol2' based on whether the 'User' has changed since the previous row and also whether the difference in the 'Time' values is greater than 1. It also looks at the first value in the arrays stored in 'Col1' and 'Col2' and updates 'newcol3' and 'newcol4' if these values have changed since the previous row.
Here's the pseudo-code for what I'm doing currently (since I've simplified the problem I haven't tested this but it's pretty similar to what I'm actually doing in ipython notebook):
 def myJFunc(df):
...     #initialize jnum counter
...     jnum = 0;
...     #loop through each row of dataframe (not including the first/zeroeth)
...     for i in range(1,len(df)):
...             #has user changed?
...             if df.User.loc[i] == df.User.loc[i-1]:
...                     #has time increased by more than 1 (hour)?
...                     if abs(df.Time.loc[i]-df.Time.loc[i-1])>1:
...                             #update new columns
...                             df['newcol2'].loc[i-1] = 1;
...                             df['newcol1'].loc[i] = 1;
...                             #increase jnum
...                             jnum += 1;
...                     #has content changed?
...                     if df.Col1.loc[i][0] != df.Col1.loc[i-1][0]:
...                             #record this change
...                             df['newcol4'].loc[i-1] = [df.Col1.loc[i-1][0], df.Col2.loc[i][0]];
...             #different user?
...             elif df.User.loc[i] != df.User.loc[i-1]:
...                     #update new columns
...                     df['newcol1'].loc[i] = 1; 
...                     df['newcol2'].loc[i-1] = 1;
...                     #store jnum elsewhere (code not included here) and reset jnum
...                     jnum = 1;
I now need to apply this function to several million rows and it's impossibly slow so I'm trying to figure out the best way to speed it up. I've heard that Cython can increase the speed of functions but I have no experience with it (and I'm new to both pandas and python). Is it possible to pass two rows of a dataframe as arguments to the function and then use Cython to speed it up or would it be necessary to create new columns with "diff" values in them so that the function only reads from and writes to one row of the dataframe at a time, in order to benefit from using Cython?  Any other speed tricks would be greatly appreciated!
(As regards using .loc, I compared .loc, .iloc and .ix and this one was marginally faster so that's the only reason I'm using that currently)
(Also, my User column in reality is unicode not int, which could be problematic for speedy comparisons)
The compare method in pandas shows the differences between two DataFrames. It compares two data frames, row-wise and column-wise, and presents the differences side by side. The compare method can only compare DataFrames of the same shape, with exact dimensions and identical row and column labels.
Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.
The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.
During data analysis, one might need to compute the difference between two rows for comparison purposes. This can be done using pandas. DataFrame. diff() function.
I was thinking along the same lines as Andy, just with groupby added, and I think this is complementary to Andy's answer.  Adding groupby is just going to have the effect of putting a NaN in the first row whenever you do a diff or shift.  (Note that this is not an attempt at an exact answer, just to sketch out some basic techniques.)
df['time_diff'] = df.groupby('User')['Time'].diff()
df['Col1_0'] = df['Col1'].apply( lambda x: x[0] )
df['Col1_0_prev'] = df.groupby('User')['Col1_0'].shift()
   User  Time                 Col1  time_diff Col1_0 Col1_0_prev
0     1     6     [cat, dog, goat]        NaN    cat         NaN
1     1     6         [cat, sheep]          0    cat         cat
2     1    12        [sheep, goat]          6  sheep         cat
3     2     3          [cat, lion]        NaN    cat         NaN
4     2     5  [fish, goat, lemur]          2   fish         cat
5     3     9           [cat, dog]        NaN    cat         NaN
6     4     4          [dog, goat]        NaN    dog         NaN
7     4    11                [cat]          7    cat         dog
As a followup to Andy's point about storing objects, note that what I did here was to extract the first element of the list column (and add a shifted version also). Doing it like this you only have to do an expensive extraction once and after that can stick to standard pandas methods.
Use pandas (constructs) and vectorize your code i.e. don't use for loops, instead use pandas/numpy functions.
'newcol1' and 'newcol2' based on whether the 'User' has changed since the previous row and also whether the difference in the 'Time' values is greater than 1.
Calculate these separately:
df['newcol1'] = df['User'].shift() == df['User']
df.ix[0, 'newcol1'] = True # possibly tweak the first row??
df['newcol1'] = (df['Time'].shift() - df['Time']).abs() > 1
It's unclear to me the purpose of Col1, but general python objects in columns doesn't scale well (you can't use fast path and the contents are scattered in memory). Most of the time you can get away with using something else...
Cython is the very last option, and not needed in 99% of use-cases, but see enhancing performance section of the docs for tips.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With