I have a large pandas DataFrame with shape (700,000, 5,000) containing columns of mixed dtypes (mostly int8, some float64, and a couple of datetime64[ns]). For each row in the dataframe I want to set the value of certain columns to zero if another column is also equal to zero.
If I iterate over the dataframe and set the values using iloc, it is super slow. I've tried both iterrows and itertuples e.g.
1. iterrows
ix_1 = 3
ix_to_change = [20, 24, 51] # Actually it is almost 5000 columns to change
for i, row in df.iterrows():
if not row[ix_1]:
df.iloc[i, ix_to_change] = 0
2. itertuples:
ix_1 = 3
ix_to_change = [20, 24, 51] # Actually it is almost 5000 columns to change
for row in df.itertuples():
if not row[ix_1 + 1]:
df.iloc[row[0], ix_to_change] = 0
I have also tried using pandas indexing, but it is also very slow (though better than iterrows or itertuples).
3. pandas loc & iloc
df.loc[df.iloc[:, ix_1]==0, df.columns[ix_to_change]] = 0
I've then tried dropping down to the underlying numpy array which works fine in terms of performance, but I run into problems with the dtypes.
It quickly iterates through the underlying array, but the new dataframe has all 'object' dtypes. If I try to set the dtypes per column (as in this example) it fails on the datetime columns - possibly because they contain NaT items.
4. numpy
X = df.values
for i, x in enumerate(X):
if not x[ix_1]:
X[i].put(ix_to_change, 0)
original_dtypes = df.dtypes
df = pd.DataFrame(data=X, index=df.index, columns=df.columns)
for col, col_dtype in original_dtypes.items():
df[c] = df[c].astype(col_dtype)
Is there a better way for me to make the update in first place?
Or if not, how should I go about keeping my dtypes the same (the datetime columns are not in the list of columns to change in case that is relevant)?
Or maybe there's a better way for me to update the original dataframe with my updated numpy array where I only update the changed columns (all of which are int8)?
As requested in the comments, here is a minimal example illustrating how int8 dtypes become object dtypes after dropping into numpy. To be clear, this is only an issue for method 4 above (which is the only non-slow method I have so far - if I can fix this dtype issue):
import pandas as pd
df = pd.DataFrame({'int8_col':[10,11,12], 'float64_col':[1.5, 2.5, 3.5]})
df['int8_col'] = df['int8_col'].astype('int8')
df['datetime64_col'] = pd.to_datetime(['2018-01-01', '2018-01-02', '2018-01-03'])
>>> df.dtypes
float64_col float64
int8_col int8
datetime64_col datetime64[ns]
dtype: object
X = df.values
# At this point in real life I modify the int8 column(s) only in X
new_df = pd.DataFrame(data=X, index=df.index, columns=df.columns)
>>> new_df.dtypes
float64_col object
int8_col object
datetime64_col object
dtype: object
For Pandas / NumPy efficiency, don't use mixed types (object dtype) within a column. There are methods available to convert series to numeric and then manipulate them efficiently.
You can use pd.DataFrame.select_dtypes to determine numeric columns. Assuming these are the only ones where you wish to update values, you can then feed these to pd.DataFrame.loc.
It quickly iterates through the underlying array, but the new dataframe has all 'object' dtypes.
Given you are left with object dtype series, it seems your definition of ix_to_change includes non-numeric series. In this case, you should convert all numeric columns to numeric dtype. For example, using pd.to_numeric:
df[ix_to_change] = df[ix_to_change].apply(pd.to_numeric, errors='coerce')
Pandas / NumPy will not help with object dtype series in terms of performance, if this is what you are after. These series are represented internally as a sequence of pointers, much like list.
Here's an example to demonstrate what you can do:
import pandas as pd, numpy as np
df = pd.DataFrame({'key': [0, 2, 0, 4, 0],
'A': [0.5, 1.5, 2.5, 3.5, 4.5],
'B': [2134, 5634, 134, 63, 1234],
'C': ['fsaf', 'sdafas',' dsaf', 'sdgf', 'fdsg'],
'D': [np.nan, pd.to_datetime('today'), np.nan, np.nan, np.nan],
'E': [True, False, True, True, False]})
numeric_cols = df.select_dtypes(include=[np.number]).columns
df.loc[df['key'] == 0, numeric_cols] = 0
Result:
A B C D E key
0 0.0 0 fsaf NaT True 0
1 1.5 5634 sdafas 2018-09-05 False 2
2 0.0 0 dsaf NaT True 0
3 3.5 63 sdgf NaT True 4
4 0.0 0 fdsg NaT False 0
No conversion to object dtype series for numeric columns, as expected:
print(df.dtypes)
A float64
B int64
C object
D datetime64[ns]
E bool
key int64
dtype: object
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With