Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Conditional column selection in pandas

I want to select columns from a DataFrame according to a particular condition. I know it can be done with a loop, but my df is very large so efficiency is crucial. The condition for column selection is having either only non-nan entries or a sequence of only nans followed by a sequence of only non-nan entries.

Here is an example. Consider the following DataFrame:

pd.DataFrame([[1, np.nan, 2, np.nan], [2, np.nan, 5, np.nan], [4, 8, np.nan, 1], [3, 2, np.nan, 2], [3, 2, 5, np.nan]])

   0    1    2    3
0  1  NaN  2.0  NaN
1  2  NaN  5.0  NaN
2  4  8.0  NaN  1.0
3  3  2.0  NaN  2.0
4  3  2.0  5.0  NaN

From it, I would like to select only columns 0 and 1. Any advice on how to do this efficiently without looping?

like image 596
splinter Avatar asked Dec 14 '25 02:12

splinter


2 Answers

logic

  • count the nulls in each column. if the only nulls are in the beginning, then the number of nulls in the column should be equal the the position of the first valid index.
  • get the first valid index
  • slice the index by the null count and compare against the first valid indices. If they are equal, then thats a good column

cnull = df.isnull().sum()
fvald = df.apply(pd.Series.first_valid_index)
cols = df.index[cnull] == fvald
df.loc[:, cols]

enter image description here


Edited with speed improvements

old answer

def pir1(df):
    cnull = df.isnull().sum()
    fvald = df.apply(pd.Series.first_valid_index)
    cols = df.index[cnull] == fvald
    return df.loc[:, cols]

much faster answer using same logic

def pir2(df):
    nulls = np.isnan(df.values)
    null_count = nulls.sum(0)
    first_valid = nulls.argmin(0)
    null_on_top = null_count == first_valid
    filtered_data = df.values[:, null_on_top]
    filtered_columns = df.columns.values[null_on_top]
    return pd.DataFrame(filtered_data, df.index, filtered_columns)

enter image description here

like image 99
piRSquared Avatar answered Dec 16 '25 02:12

piRSquared


Consider a DF as shown which has Nans in various possible locations:

Image

1. Both sides Nans present:

Create a mask by replacing all nans with 0's and finite values with 1's:

mask = np.where(np.isnan(df), 0, 1)

Take it's corresponding element difference across each column. Next, take modulus of it's values. Logic here is that whenever there are three unique values in each column, then discard that column(namely → -1,1,0) as there would be a break in the sequence for such a situation.

Idea is to take the sum and create a subset wherever the sum results in a value less than 2.(As after taking mod, we get 1,1,0). So, for the extreme case, we get sum as 2 and those columns certainly are disjoint and must be discarded.

criteria = pd.DataFrame(mask, columns=df.columns).diff(1).abs().sum().lt(2)

Finally transpose the DF and use this condition and re-transpose to get the desired result having only Nans in one portion and finite values in the other.

df.loc[:, criteria]

Image

2. Nans present on top:

mask = np.where(np.isnan(df), 0, 1)
criteria = pd.DataFrame(mask, columns=df.columns).diff(1).ne(-1).any()
df.loc[:, criteria]

Image

like image 32
Nickil Maveli Avatar answered Dec 16 '25 02:12

Nickil Maveli



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!