I am trying to measure message activity on a forum given the data for each individual message.
To do this, I want to know how many unique/different authors have posted messages in the last 24 hours, at every half-hour (9:00, 9:30, 10:00, ...).
I have a pandas DataFrame to record message information. A message has a post number, the time it was posted, and who wrote it. This is some of my data:
>>> # import pandas as pd
>>> # here df is a pd.DataFrame
>>> print df.loc[:, ['Message Timestamp','Message Author']]
Post# Message Timestamp Message Author
239257 2017-06-09 14:45:46-04:00 JTTLJTTLFBVTNJDF
239258 2017-06-09 14:09:51-04:00 Tvpfrnpvb22
239259 2017-06-09 13:54:13-04:00 Hpzb Tbxb
239260 2017-06-09 13:45:37-04:00 TbnFrbnTrbdfr
239261 2017-06-09 13:28:55-04:00 JTTLJTTLFBVTNJDF
239262 2017-06-09 13:20:23-04:00 njlftlj84
239263 2017-06-09 13:19:59-04:00 TbnFrbnTrbdfr
239264 2017-06-09 13:19:23-04:00 Vjtb Npvb Ttpdlt
239265 2017-06-09 13:15:03-04:00 njlftlj84
239266 2017-06-09 13:06:07-04:00 vndpnnpndfntt
239267 2017-06-09 12:48:54-04:00 TbnFrbnTrbdfr
239268 2017-06-09 12:16:59-04:00 Hrffn n Hpld
239269 2017-06-09 12:06:12-04:00 Xbllfr
239270 2017-06-09 11:27:33-04:00 TbttppfdTrbdfrFrz
239271 2017-06-09 11:21:46-04:00 ND`jn`BjhD
239272 2017-06-09 11:19:34-04:00 TbnFrbnTrbdfr
239273 2017-06-09 10:55:01-04:00 bbndpntfbdfll
239274 2017-06-09 10:55:01-04:00 JTTLJTTLFBVTNJDF
................(continued for years).................
For instance, using the above data, we see that user "JTTLJTTLFBVTNJDF" has posted at least three times in the last twenty four hours; that person only contributes 1 to the "number of unique authors in the last twenty four hours".
The desired output would look like the following (depending on the prior 24 hours of data):
>>> print some_function(df, past='24 hours', every='30 mins')
Index Number_of_unique_authors_in_the_last_24_hours
2017-06-09 15:00:00-04:00 12
2017-06-09 14:30:00-04:00 11
2017-06-09 14:00:00-04:00 13
...(and so forth)......
For clarity, this is saying at 15:00, there were 12 different people who had posted messages in the last twenty-four hours, based on the data.
I have tried different combinations of pd.Timestamp.ceil, groupby, and rolling, but nothing close to what I want. An experienced person might know the right combination of things to do.
Also, let me know if someone has a better title idea.
edit: I'm a bit surprised something like below doesn't work.
series = df.set_index('Message Timestamp')['Message Author']
series.resample('30 min').rolling('1D').nunique() #not supported
I know how to get a for-loop to do what I want, but it would be nice to find a pandas-powered way of doing things.
Consider resample to aggregate to 30 minute intervals and then run nunique. And then run a transform to conditionally aggregate the new unique count column for every 24 hours. You would need to first set the timestamp as index for resample and then back as regular column to aggregate to 24 hours.
import datetime
import pandas as pd
...
df = df[['Message Timestamp', 'Message Author']]
df['24-HourCount'] = df.transform(lambda x:\
len(df[(df['Message Timestamp'].between(x['Message Timestamp'] - datetime.timedelta(days=1),
x['Message Timestamp']))]['Message Author'].unique()), axis=1)
df = df.set_index('Message Timestamp')
df = df[['24-HourCount']].resample('30T').max()
print(df)
# 24-HourCount
# Message Timestamp
# 2017-06-09 14:30:00 2.0
# 2017-06-09 15:00:00 5.0
# 2017-06-09 15:30:00 NaN
# 2017-06-09 16:00:00 7.0
# 2017-06-09 16:30:00 7.0
# 2017-06-09 17:00:00 10.0
# 2017-06-09 17:30:00 11.0
# 2017-06-09 18:00:00 12.0
# 2017-06-09 18:30:00 12.0
I wasn't able to come up with some elegant solution, but a brutal force one by looping through the dataframe, hopefully that will work if your dataset is not huge:
time, unique_count = [], []
for i in range(len(df)):
time.append(t)
t = df.ix[i, 'Time']
#get the datetime of 24 hours ago
yesterday = t - timedelta(days=1)
#filter the original dataframe and count unique authors
count = len(df.ix[(df['Time']<=t) &
(df['Time']>=yesterday),'Author'].unique())
unique_count.append(count)
result = pd.DataFrame({'Time': time,
'Number_of_unique_authors':unique_count})
Waiting for someone to come up with a more elegant solution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With