Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Assignment to containers in Pandas

Tags:

python

pandas

I want to replace None entries in a specific column in Pandas with an empty list.

Note that some entries in this column may already have an empty list in them, and I don't want to touch those.

I have tried:

indices = np.equal(df[col],None)
df[col][indices] = []

and

indices = np.equal(df[col],None)
df[col][indices] = list()

but both solutions fail with:

ValueError: Length of replacements must equal series length

Why? How can I update those specific rows with an empty list?

like image 940
Josh Avatar asked Sep 05 '25 23:09

Josh


2 Answers

Using endemic lists is not allowed on assignment and is not recommended to do this at all.

You can do it if you create from scratch

In [50]: DataFrame({ 'A' : [[],[],1]})
Out[50]: 
    A
0  []
1  []
2   1

[3 rows x 1 columns]

The reason this is not allowed is that without indicies (e.g. say in numpy), you can do something like this:

In [51]: df = DataFrame({ 'A' : [1,2,3] })

In [52]: df.loc[df['A'] == 2] = [ 5 ]

In [53]: df
Out[53]: 
   A
0  1
1  5
2  3

[3 rows x 1 columns]

You can do an assignment where the length of the True values in the mask are equal to the length of the list/tuple/ndarray on the rhs (e.g. the value you are setting). Pandas allows this, as well as a length that is exactly equal to the lhs, and a scalar. Anything else is expressly disallowed because its ambiguous (e.g. do you mean to align it or not?)

For example, imagine:

In [54]: df = DataFrame({ 'A' : [1,2,3] })

In [55]: df.loc[df['A']<3] = [5]
ValueError: cannot set using a list-like indexer with a different length than the value

A 0-length list/tuple/ndarray is considered an error not because it can't be done, but usually its user error, its unclear what to do.

Bottom line, don't use lists inside of a pandas object. Its not efficient, and just makes interpretation difficult / impossible.

like image 97
Jeff Avatar answered Sep 08 '25 22:09

Jeff


Edit: Preserved my original answer below, but I put it up without testing it and it actually doesn't work for me.

import pandas as pd
import numpy as np
ser1 = pd.Series(['hi',None,np.nan])
ser2 = pd.Series([5,7,9])
df = pd.DataFrame([ser1,ser2]).T

This is janky, I know. Also, apparently the DataFrame constructor (but not the Series constructor) coerces None to np.nan. No idea why.

df.loc[1,0] = None

So now we have

    0     1
0   'hi'  5
1   None  7
2   NaN   9

df.columns = ['col1','col2']
mask = np.equal(df['col1'], None)
df.loc[mask, 'col1'] = []

But this doesn't assign anything. The dataframe looks the same as before. I'm following the recommended usage from the docs and assigning base types (strings and numbers) works. So for me the problem is assigning objects to dataframe entries. No idea what's up.


(Original answer)

Two things:

  1. I'm not familiar with np.equal but pandas.isnull() should also work if you want to capture all null values.
  2. You are doing what is called "chained assignment". I don't understand the problem fully but I know it doesn't work. In the docs.

Try this:

mask = pandas.isnull(df[col])
df.loc[mask, col] = list()

Or, if you only want to catch None and not np.nan:

mask = np.equal(df[col].values, None) 
df.loc[mask, col] = list()

Note: While pandas.isnull works with None on dataframes, series, and arrays as expected, numpy.equal only works as expected with dataframes and arrays. A pandas Series of all None will not return True for any of them. This is due to None only selectively behaving as np.nan See BUG: None is not equal to None #20442

like image 26
exp1orer Avatar answered Sep 08 '25 22:09

exp1orer