Suppose we have a dataframe of 100.000 rows and 3 columns, structured like so:
| visitorId | timestamp | url |
1 | 1 | 11 | A |
2 | 1 | 12 | B |
3 | 2 | 21 | A |
4 | 3 | 31 | A |
5 | 3 | 32 | C |
.
.
n | z | Z1 | A |
This dataframe is always sorted, and stored in a variable called sortedData
. First, we extract all the unique visitor IDs into a list called visitors
, and create a path variable to hold all the URLs this visitor has visited.
visitors = sortedData.visitorId.unique()
visitors = visitors.tolist()
paths = []
visitor_length = len(visitors)
Now what I did was go into a loop for each visitor in order to find and store each visitor's traversed path, their timestamp and ID, and use it as an input in a secondary algorithm(not relevant). I went two ways to do this:
A:
for visitor in visitors:
dataResult = sortedData.loc[sortedData['visitorId'] == visitor]
timestamps = dataResult.timestamp.tolist()
path = dataResult.pageUrl.tolist()
paths.append((visitor, timestamps, path))
sortedData = sortedData.iloc[len(dataResult):]
This uses the built-in pandas loc
function to find the rows with the matching visitorId
and extract the timestamps and paths into lists, and finally append them together. Then it goes on to delete the first x rows (equal to the length of the query result, aka the number of matches) in order to not traverse them in the future when doing similar matches.
The process is repeated for every unique visitor
in the visitors
list.
By timing this, I found that it took about 6.31
seconds to complete. In my case this was an 11.7MB file, 100.000 rows. On a 1.2GB file this scaled up to 14 hours. As such, I tried way B hoping for a speedup.
Way B uses the logic that the data is always sorted, as such visitor 1 in visitors
will always be the 1st visitor in sortedData
, visitor 2 the 2nd etc. So I could use the pandas value_counts()
function to count the occurrences x
of the current visitor, and simply extract the data from the first x
rows using head(x)
, since they will always match. This way it would not have to iterate and search through the whole dataframe every time. Then as before, I remove those rows from the dataframe and repeat the loop for the next visitor
.
B:
for visitor in visitors:
x = sortedData.visitorId.value_counts()[visitor]
timestamps = sortedData.timestamp.head(x).tolist()
path = sortedData.pageUrl.head(x).tolist()
paths.append((visitor, timestamps, path))
sortedData = sortedData.iloc[x:]
To my surprise, this made it almost twice as slow, to a 10.89
seconds runtime compared to the 6.31
from A.
Timing the loc
and value_counts()
it appeared that the latter was faster, however when used in the loop the opposite is true.
Considering in B we know the positions of the visitors and we only have to iterate through the first x rows of the dataframe, and in A we have to search the whole dataframe every time, what causes this difference in performance?
In a previous optimization I did, with deleting the already traversed rows, the speedup was considerable, doubling each time the dataframe size is halved in contrast to leaving it whole. This made me suspect that it iterates through the whole dataframe every time in way A, except if I am missing something?
I am using a MacBook Pro 2014, Python 3.6.4(Anaconda) running on PyCharm 2018.
Creating your own list of visitors, iterating over those and searching the data frame again and again is not ideal.
Have a look at groupby, which can be used in your situation, if I understand your question correctly.
To have code similar to what you have now, you can start like this:
grouped = sortedData.groupby('visitorId')
for visitorId, group in grouped:
print(vistorId)
# your code here
custom_url_algorithm(group)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With