Calculate time difference between consecutive rows in pairs per group in pyspark

Question

I want to calculate the time spent per SeqID for each user. I have a dataframe like this. However, the time is split between two actions for every user, Action_A and Action_B. The total time per user, per seqID would be sum across all such pairs

For first user, it is 5 + 3 [(2019-12-10 10:00:00 - 2019-12-10 10:05:00) + (2019-12-10 10:20:00 - 2019-12-10 10:23:00)]

So first user has ideally spent 8 mins for SeqID 1 (and not 23 mins).

Similarly user 2 has spent 1 + 5 = 6 mins

How can I calculate this using pyspark?

data = [(("ID1", 15, "2019-12-10 10:00:00", "Action_A")), 
        (("ID1", 15, "2019-12-10 10:05:00", "Action_B")),
        (("ID1", 15, "2019-12-10 10:20:00", "Action_A")),
        (("ID1", 15, "2019-12-10 10:23:00", "Action_B")),
        (("ID2", 23, "2019-12-10 11:10:00", "Action_A")),
        (("ID2", 23, "2019-12-10 11:11:00", "Action_B")),
        (("ID2", 23, "2019-12-10 11:30:00", "Action_A")),
        (("ID2", 23, "2019-12-10 11:35:00", "Action_B"))]
df = spark.createDataFrame(data, ["ID", "SeqID", "Timestamp", "Action"])
df.show()

+---+-----+-------------------+--------+
| ID|SeqID|          Timestamp|  Action|
+---+-----+-------------------+--------+
|ID1|   15|2019-12-10 10:00:00|Action_A|
|ID1|   15|2019-12-10 10:05:00|Action_B|
|ID1|   15|2019-12-10 10:20:00|Action_A|
|ID1|   15|2019-12-10 10:23:00|Action_B|
|ID2|   23|2019-12-10 11:10:00|Action_A|
|ID2|   23|2019-12-10 11:11:00|Action_B|
|ID2|   23|2019-12-10 11:30:00|Action_A|
|ID2|   23|2019-12-10 11:35:00|Action_B|
+---+-----+-------------------+--------+

Once I have the data for each pair, I can sum across the group (ID, SeqID)

Expected output (could be seconds also)

+---+-----+--------+
| ID|SeqID|Dur_Mins|
+---+-----+--------+
|ID1|   15|       8|
|ID2|   23|       6|
+---+-----+--------+

blackbishop · Accepted Answer

Here is a possible solution using Higher-Order Functions (Spark >=2.4):

transform_expr = "transform(ts_array, (x,i) -> (unix_timestamp(ts_array[i+1]) - unix_timestamp(x))/60 * ((i+1)%2))"

df.groupBy("ID", "SeqID").agg(array_sort(collect_list(col("Timestamp"))).alias("ts_array")) \
    .withColumn("transformed_ts_array", expr(transform_expr)) \
    .withColumn("Dur_Mins", expr("aggregate(transformed_ts_array, 0D, (acc, x) -> acc + coalesce(x, 0D))")) \
    .drop("transformed_ts_array", "ts_array") \
    .show(truncate=False)

Steps:

Collect all timestamps to array for each group ID, SeqID and sort them in ascending order
Apply a transform to the array with lambda function (x, i) => Double. Where x is the actual element and i its index. For each timestamp in the array, we calculate the diff with the next timestamp. And we multiply by (i+1)%2 in order to have only the diff as pairs 2 per 2 (first with the second, third with the fourth, ...) as there are always 2 actions.
Finally, we aggregate the result array of transformation to sum all the elements.

Output:

+---+-----+--------+
|ID |SeqID|Dur_Mins|
+---+-----+--------+
|ID1|15   |8.0     |
|ID2|23   |6.0     |
+---+-----+--------+

Calculate time difference between consecutive rows in pairs per group in pyspark

Tags:

apache-spark

apache-spark-sql

pyspark

Hardik Gupta

1 Answers

blackbishop

Recent Activity

Donate For Us

Calculate time difference between consecutive rows in pairs per group in pyspark

Tags:

apache-spark

apache-spark-sql

pyspark

Hardik Gupta

1 Answers

blackbishop

Related questions

Recent Activity

Donate For Us