inplace=True doesn't work for subset data

Question

I am trying to fill missing values in subset of rows. I am using inplace=True in fillna(), but it is not working in jupyter notebook. You can see attached picture showing NaN in the first 2 rows in column of Surface. I am not sure why?

I have to do this so it is working. why? Thank you for your help.

data.loc[mark,'Surface']=data.loc[mark,'Surface'].fillna(value='TEST')

Here are my codes

mark=(data['Pad']==51) | (data['Pad']==52) | (data['Pad']==53) | (data['Pad']==54) | (data['Pad']==55)

data.loc[mark,'Surface'].fillna(value='TEST',inplace=True)

This one is working:

data.loc[mark,'Surface']=data.loc[mark,'Surface'].fillna(value='TEST')

enter image description here

Cameron Riddell · Accepted Answer

The main issue you're bumping into here is that pandas does not have very explicit view vs copy rules. Your result indicates to me that the issue here is .loc is returning a copy instead of a view. While pandas does try to return a view from .loc, there are a decent number of caveats.

After playing around a little, it seems that using a boolean/positional index mask return a copy- you can verify this with the private _is_view attribute:

import pandas as pd
import numpy as np

df = pd.DataFrame({"Pad": range(40, 60), "Surface": np.nan})

print(df)
   Pad  Surface
0   40      NaN
1   41      NaN
2   42      NaN
.  ...      ...
19  59      NaN


# Create masks
bool_mask = df["Pad"].isin(range(51, 56))
positional_mask = np.where(bool_mask)[0]

# Check `_is_view` after simple .loc:
>>> df.loc[bool_mask, "Surface"]._is_view
False

>>> df.loc[positional_mask, "Surface"]._is_view
False

So neither of the approaches above return a "view" of the original data, which is why performing an inplace operation does not change the original dataframe. In order to return a view from .loc you will need to use a slice as your row-index.

>>> df.loc[10:15, "Surface"]._is_view
True

Now this still won't resolve your issue because the value you're filling NaN with may or may not change the dtype of the "Surface" column. In the example I have set up, "Surface" has a float64 dtype- and by filling in NaN with the value "Test", you are forcing that dtype to change which is incompatible with the original dataframe. If your "Surface" columns is an object dtype, then you don't need to worry about this.

>>> df.dtypes
Pad          int64
Surface    float64

# this does not work because "Test" is incompatible with float64 dtype
>>> df.loc[10:15, "Surface"].fillna("Test", inplace=True)

# this works because 0.9 is an appropriate value for a float64 dtype
>>> df.loc[10:15, "Surface"].fillna(0.9, inplace=True)
>>> print(df)
    Pad  Surface
..  ...      ...
8    48      NaN
9    49      NaN
10   50      0.9
11   51      0.9
12   52      0.9
13   53      0.9
14   54      0.9
15   55      0.9
16   56      NaN
17   57      NaN
..  ...      ...

TLDR; don't rely on inplace in pandas in general. In the bulk of its operations it still creates a copy of the underlying data and then attempts to replace the original source with the new copy. Pandas is not memory efficient so if you're worried about memory-performance you may want to switch to something designed to be zero-copy from the ground up like Vaex, instead of trying to go through pandas.

Your approach of assigning the slice of the dataframe is the most appropriate and will ensure you receive the correct result of updating the dataframe as "inplace" as possible:

>>> df.loc[bool_mask, "Surface"] = df.loc[bool_mask, "Surface"].fillna("Test")

inplace=True doesn't work for subset data

Tags:

python

roudan

1 Answers

Cameron Riddell

Recent Activity

Donate For Us

inplace=True doesn't work for subset data

Tags:

python

roudan

1 Answers

Cameron Riddell

Related questions

Recent Activity

Donate For Us