Efficiently update values in a pandas dataframe with mixed dtype columns

Question

I have a large pandas DataFrame with shape (700,000, 5,000) containing columns of mixed dtypes (mostly int8, some float64, and a couple of datetime64[ns]). For each row in the dataframe I want to set the value of certain columns to zero if another column is also equal to zero.

If I iterate over the dataframe and set the values using iloc, it is super slow. I've tried both iterrows and itertuples e.g.

1. iterrows

ix_1 = 3
ix_to_change = [20, 24, 51]  # Actually it is almost 5000 columns to change
for i, row in df.iterrows():
    if not row[ix_1]:
        df.iloc[i, ix_to_change] = 0

2. itertuples:

ix_1 = 3
ix_to_change = [20, 24, 51]  # Actually it is almost 5000 columns to change
for row in df.itertuples():
    if not row[ix_1 + 1]:
        df.iloc[row[0], ix_to_change] = 0

I have also tried using pandas indexing, but it is also very slow (though better than iterrows or itertuples).

3. pandas loc & iloc

df.loc[df.iloc[:, ix_1]==0, df.columns[ix_to_change]] = 0

I've then tried dropping down to the underlying numpy array which works fine in terms of performance, but I run into problems with the dtypes.

It quickly iterates through the underlying array, but the new dataframe has all 'object' dtypes. If I try to set the dtypes per column (as in this example) it fails on the datetime columns - possibly because they contain NaT items.

4. numpy

X = df.values
for i, x in enumerate(X):
    if not x[ix_1]:
        X[i].put(ix_to_change, 0)
original_dtypes = df.dtypes
df = pd.DataFrame(data=X, index=df.index, columns=df.columns)
for col, col_dtype in original_dtypes.items():
    df[c] = df[c].astype(col_dtype)

Is there a better way for me to make the update in first place?

Or if not, how should I go about keeping my dtypes the same (the datetime columns are not in the list of columns to change in case that is relevant)?

Or maybe there's a better way for me to update the original dataframe with my updated numpy array where I only update the changed columns (all of which are int8)?

Update

As requested in the comments, here is a minimal example illustrating how int8 dtypes become object dtypes after dropping into numpy. To be clear, this is only an issue for method 4 above (which is the only non-slow method I have so far - if I can fix this dtype issue):

import pandas as pd

df = pd.DataFrame({'int8_col':[10,11,12], 'float64_col':[1.5, 2.5, 3.5]})
df['int8_col'] = df['int8_col'].astype('int8')
df['datetime64_col'] = pd.to_datetime(['2018-01-01', '2018-01-02', '2018-01-03'])

>>> df.dtypes
float64_col              float64
int8_col                    int8
datetime64_col    datetime64[ns]
dtype: object

X = df.values
# At this point in real life I modify the int8 column(s) only in X

new_df = pd.DataFrame(data=X, index=df.index, columns=df.columns)

>>> new_df.dtypes
float64_col       object
int8_col          object
datetime64_col    object
dtype: object

jpp · Accepted Answer

TL;DR

For Pandas / NumPy efficiency, don't use mixed types (object dtype) within a column. There are methods available to convert series to numeric and then manipulate them efficiently.

You can use pd.DataFrame.select_dtypes to determine numeric columns. Assuming these are the only ones where you wish to update values, you can then feed these to pd.DataFrame.loc.

It quickly iterates through the underlying array, but the new dataframe has all 'object' dtypes.

Given you are left with object dtype series, it seems your definition of ix_to_change includes non-numeric series. In this case, you should convert all numeric columns to numeric dtype. For example, using pd.to_numeric:

df[ix_to_change] = df[ix_to_change].apply(pd.to_numeric, errors='coerce')

Pandas / NumPy will not help with object dtype series in terms of performance, if this is what you are after. These series are represented internally as a sequence of pointers, much like list.

Here's an example to demonstrate what you can do:

import pandas as pd, numpy as np

df = pd.DataFrame({'key': [0, 2, 0, 4, 0],
                   'A': [0.5, 1.5, 2.5, 3.5, 4.5],
                   'B': [2134, 5634, 134, 63, 1234],
                   'C': ['fsaf', 'sdafas',' dsaf', 'sdgf', 'fdsg'],
                   'D': [np.nan, pd.to_datetime('today'), np.nan, np.nan, np.nan],
                   'E': [True, False, True, True, False]})

numeric_cols = df.select_dtypes(include=[np.number]).columns

df.loc[df['key'] == 0, numeric_cols] = 0

Result:

     A     B       C          D      E  key
0  0.0     0    fsaf        NaT   True    0
1  1.5  5634  sdafas 2018-09-05  False    2
2  0.0     0    dsaf        NaT   True    0
3  3.5    63    sdgf        NaT   True    4
4  0.0     0    fdsg        NaT  False    0

No conversion to object dtype series for numeric columns, as expected:

print(df.dtypes)

A             float64
B               int64
C              object
D      datetime64[ns]
E                bool
key             int64
dtype: object

Efficiently update values in a pandas dataframe with mixed dtype columns

Tags:

python

pandas

numpy

Update

Ben

1 Answers

TL;DR

jpp

Recent Activity

Donate For Us

Efficiently update values in a pandas dataframe with mixed dtype columns

Tags:

python

pandas

numpy

Update

Ben

1 Answers

TL;DR

jpp

Related questions

Recent Activity

Donate For Us