Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas string addition across columns

Tags:

python

pandas

This is perfectly legal in Python:

In [1]: 'abc' + 'def'
Out[1]: 'abcdef'

If I have an all text Pandas DataFrame, like the example below:

In [2]: df = pd.DataFrame([list('abcd'), list('efgh'), list('ijkl')],
                          columns=['C1','C2','C3','C4'])
        df.loc[[0,2], ['C2', 'C3']] = np.nan
        df
Out[2]:     C1  C2  C3  C4
        0   a   NaN NaN d
        1   e   f   g   h
        2   i   NaN NaN l

Is it possible to do the same with columns of the above DataFrame? Something like:

In [3]: df.apply(+, axis=1) # Or
        df.sum(axis=1)

Note that both of the statements above don't work. Using .str.cat() in a loop is easy, but I am looking for something better.


Expected output is:

Out[3]:    C
        0  ad
        1  efgh
        2  il
like image 389
Kartik Avatar asked Dec 05 '25 18:12

Kartik


2 Answers

You could do

df.fillna('').sum(axis=1)

Of course, this assumes that your dataframe is made up only of strings and NaNs.

like image 173
Ken Wei Avatar answered Dec 08 '25 15:12

Ken Wei


Option 1
stack

I wanted to add it for demonstration. We don't have to accept the rectangular nature of the dataframe and use stack. When we do, stack drops nan by default. Leaving us with a vector of strings and a pd.MultiIndex. We can groupby the first level of this pd.MultiIndex (which used to be row indices) and perform summation:

df.stack().groupby(level=0).sum()

0      ad
1    efgh
2      il
dtype: object

Option2
Use Masked Arrays np.ma.masked_array
I was motivated by @jezrael to post a faster solution (-:

pd.Series(
    np.ma.masked_array(
        df.values,
        df.isnull().values,
    ).filled('').sum(1),
    df.index
)

0      ad
1    efgh
2      il
dtype: object

Timing

df = pd.concat([df]*1000).reset_index(drop=True)

%%timeit
pd.Series(
    np.ma.masked_array(
        df.values,
        df.isnull().values,
        fill_value=''
    ).filled('').sum(1),
    df.index
)

1000 loops, best of 3: 860 µs per loop

%timeit (pd.Series(df.fillna('').values.sum(axis=1), index=df.index))

1000 loops, best of 3: 1.33 ms per loop
like image 43
piRSquared Avatar answered Dec 08 '25 15:12

piRSquared



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!