Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to effectively update a dataframe's column without getting a SettingWithCopyWarning?

I have a dataframe with multiple columns and I simply want to update a column with new values df['Z'] = df['A'] % df['C']/2. However, I keep getting SettingWithCopyWarning message even when I use the .loc[] method or when I drop() the column and add it again.

:75: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

Although the warning disappears with .assign() method, but it is painstakingly slower. Here is a comparison

df = pd.DataFrame(data=np.random.randn(2000000, 26), 
                  columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'))

%timeit df['Z'] = df['A'] % df['C']/2
119 ms ± 2.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.loc[:, 'Z'] = df['A'] % df['C']/2
118 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.assign(Z=df['A'] % df['C']/2)
857 ms ± 10.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So what's the optimal way to update a column in the dataframe. Note that I don't have the option to create multiple copies of the same dataframe because of its huge size.

like image 588
exan Avatar asked Oct 19 '25 14:10

exan


1 Answers

tl;dr - make a copy of the slice using copy or suppress the warning with pd.set_option('mode.chained_assignment', None)

There are some great posts about SettingWithCopy Warnings. First off, I say, this is just a warning and not an error. Most of the time this is warning you of behavior you didn't really intend to happen anyway or you really don't care.

Now, let's avoid this warning. Giving your data I am going to duplicate the warning first on purpose.

df = pd.DataFrame(data=np.random.randn(2000000, 26), 
                  columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'))

# if we use execute df['Z'] = df['A'] % df['C']/2 no warning here.
df['Z'] = df['A'] % df['C']/2

# However, let's slice this dataframe just removing the last row using this syntax
df_slice = df.loc[:1999998]
df_slice['Z'] = df_slice['A'] % df_slice['C']/2

C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy """Entry point for launching an IPython kernel.

In this case, this warning is letting you know you are changing the original df object.

df = pd.DataFrame(data=np.random.randn(2000000, 26), 
                  columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'))
df_slice = df.loc[:1999998]
df_slice['Z'] = df_slice['A'] % df_slice['C']/2
all(df.loc[:1999998, 'Z'] == df_slice['Z'])

Returns the above warning and True, modifying the slice did change the original df object.

Now, to avoid the warning and not changing the original object use copy

df = pd.DataFrame(data=np.random.randn(2000000, 26), 
                  columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'))

df_slice = df.loc[:1999998].copy()
df_slice['Z'] = df_slice['A'] % df_slice['C']/2
all(df.loc[:1999998, 'Z'] == df_slice['Z'])

Returns no warning and False.

So, this is one way to use retaining your performance with first and second methods by using .copy() when creating your slice/view of a dataframe. However, you are correct this does take extra memory. Overwrite your dataframe with .copy()

Or you can turn this warning off using:

pd.set_option('mode.chained_assignment', None)
df = pd.DataFrame(data=np.random.randn(2000000, 26), 
                  columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'))

df_slice = df.loc[:1999998]
df_slice['Z'] = df_slice['A'] % df_slice['C']/2
all(df.loc[:1999998, 'Z'] == df_slice['Z'])

Returns No warning and True.

In short, pandas sometimes creates a new object for slices of a dataframe, and sometimes it doesn't, where this new slice is a view of the original dataframe. When pandas does this is understood by few and not very well documented I where I could find it.

There is a strong hint to when this warning will appear and that is to use the _is_view attribute.

df_slice = df.loc[:1999998]
df_slice._is_view

Returns True, hence the SettingWithCopyError might happen.

df_slice = df.loc[:1999998].copy()
df_slice._is_view

Returns False.

like image 59
Scott Boston Avatar answered Oct 22 '25 03:10

Scott Boston



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!