Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to split data into groups in pyspark

I need to find groups in time series data.

Data sample

enter image description here

I need to output column group based on value and day.

I've tried using lag, lead and row_number but it ended up to nothing.

like image 815
paveltr Avatar asked Dec 18 '25 19:12

paveltr


1 Answers

It seems like you want to increment the group everytime the value changes. If so, this is a kind of gaps-and-islands problem.

Here is one approach that uses lag() and a cumulative sum():

select
    value,
    day,
    sum(case when value = lag_value then 0 else 1 end) over(order by day) grp
from (
    select t.*, lag(value) over(order by day) lag_value
    from mytable t
) t
like image 93
GMB Avatar answered Dec 20 '25 13:12

GMB



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!