I have contiguous periods of NaN values. I want to count NaN values from periods of contiguous NaN values, and also i want the start and end date of the contiguos period of NaN values.
df :
TMIN
2017-01-01 00:00:00 2.5
2017-01-02 00:00:00 NaN
2017-01-03 00:00:00 NaN
2017-01-04 00:00:00 2.2
2018-01-01 00:00:00 NaN
2018-01-02 00:00:00 NaN
2018-01-03 00:00:00 NaN
2018-01-04 00:00:00 5.0
2019-01-01 00:00:00 9.0
2019-01-02 00:00:00 8.0
2019-01-03 00:00:00 2.0
2019-01-04 00:00:00 NaN
2020-01-01 00:00:00 NaN
2020-01-02 00:00:00 NaN
2020-01-03 00:00:00 1.0
2020-01-04 00:00:00 NaN
Expected results :
Start_Date End date number of contiguous missing values
2017-01-02 00:00:00 2017-01-03 00:00:00 2
2018-01-01 00:00:00 2018-01-03 00:00:00 3
2019-01-04 00:00:00 2019-01-04 00:00:00 1
2020-01-01 00:00:00 2020-01-02 00:00:00 2
2020-01-04 00:00:00 2020-01-04 00:00:00 1
How can i solve this? Thanks in advance.
CODE:
s = df.index.to_series()
m1, m2 = s.diff().dt.days.eq(1), df['TMIN'].isna()
out = s[m2].groupby([(~m1).cumsum(), (~m2).cumsum()])\
.agg(['first', 'last', 'count']).reset_index(drop=True)
DETAILS:
Create boolean masks m1 and m2 such that m1 represents the condition where the difference between successive dates is 1 days and m2 represent the condition where the column TMIN contains NaN values:
>>> m1
2017-01-01 False
2017-01-02 True
2017-01-03 True
2017-01-04 True
2018-01-01 False
2018-01-02 True
2018-01-03 True
2018-01-04 True
2019-01-01 False
2019-01-02 True
2019-01-03 True
2019-01-04 True
2020-01-01 False
2020-01-02 True
2020-01-03 True
2020-01-04 True
dtype: bool
>>> m2
2017-01-01 False
2017-01-02 True
2017-01-03 True
2017-01-04 False
2018-01-01 True
2018-01-02 True
2018-01-03 True
2018-01-04 False
2019-01-01 False
2019-01-02 False
2019-01-03 False
2019-01-04 True
2020-01-01 True
2020-01-02 True
2020-01-03 False
2020-01-04 True
Name: TMIN, dtype: bool
Use cumsum with the above boolean masks to identify the blocks of contiguous dates and NaN values:
>>> (~m1).cumsum()
2017-01-01 1
2017-01-02 1
2017-01-03 1
2017-01-04 1
2018-01-01 2
2018-01-02 2
2018-01-03 2
2018-01-04 2
2019-01-01 3
2019-01-02 3
2019-01-03 3
2019-01-04 3
2020-01-01 4
2020-01-02 4
2020-01-03 4
2020-01-04 4
dtype: int64
>>> (~m2).cumsum()
2017-01-01 1
2017-01-02 1
2017-01-03 1
2017-01-04 2
2018-01-01 2
2018-01-02 2
2018-01-03 2
2018-01-04 3
2019-01-01 4
2019-01-02 5
2019-01-03 6
2019-01-04 6
2020-01-01 6
2020-01-02 6
2020-01-03 7
2020-01-04 7
Name: TMIN, dtype: int64
Finally group the index of the datframe on the above blocks and aggregate using first, last and count to get the result:
>>> out
first last count
0 2017-01-02 2017-01-03 2
1 2018-01-01 2018-01-03 3
2 2019-01-04 2019-01-04 1
3 2020-01-01 2020-01-02 2
4 2020-01-04 2020-01-04 1
trying something different with more_iterools (standard python lib)
from more_itertools import consecutive_groups
m = df[df['TMIN'].isna()]
l = [list(i) for i in consecutive_groups(m.index.map(pd.Timestamp.toordinal))]
d = {pd.Timestamp.fromordinal(ele):e for e,item in enumerate(l) for ele in item}
out = (pd.Series(m.index,index=m.index.map(d))
.groupby(level=0).agg(['min','max','count']))
#out.columns = ['Start_Date','End date','number of contiguous missing values']
print(out)
min max count
0 2017-01-02 2017-01-03 2
1 2018-01-01 2018-01-03 3
2 2019-01-04 2019-01-04 1
3 2020-01-01 2020-01-02 2
4 2020-01-04 2020-01-04 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With