Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the best way to tell the missing row in pandas DataFrame?

Tags:

python

pandas

I'm new to Python - pandas, currently trying to use it to check whether the data in DataFrame is continuous. For example:

    thread  sequence      start      end
14       1       114    1647143  1672244
15       1       115    1672244  1689707
16       1       116    1689707  1713090
17       1       118    1735352  1760283
18       1       119    1760283  1788062
19       1       120    1788062  1789885
20       1       121    1789885  1790728

Every row owns 4 columns, in general sequence should be increased with step of 1, so if everything is correct, it would look like 116,117,118 ... , like a range() function. But example here missing the row with sequence == 117.

I tried to find it, but I don't know how to do it. If I just check the sequence one by one, it would be inefficient. The desired output would be to tell the missing row or fill the missing row with NaN.

Any good tips or suggestion would be helpful.

like image 786
Castor Avatar asked Jan 20 '26 18:01

Castor


2 Answers

A faster method using RangeIndex:

seq = pd.RangeIndex(df.sequence.min(), df.sequence.max())
seq[~seq.isin(df.sequence)].values
# array([117])
like image 175
cs95 Avatar answered Jan 22 '26 07:01

cs95


If you just want to get the missing sequence values you can do something like this:

>>> seq = pd.DataFrame(np.arange(df.iloc[0].sequence, df.iloc[-1].sequence))
>>> seq[~seq[0].isin(df.sequence)]
    0
3   117
like image 36
Cory Madden Avatar answered Jan 22 '26 08:01

Cory Madden