I am taking an ML course at Udemy and currently reading about Feature Engineering. There is a need to remove duplicate columns (feature) from the dataset and author has suggested two versions of the code.
Data Set Download Link
Version 1:
Version 1 uses matrix transpose and then applied duplicated() method as follows
data_unique = data.T.drop_duplicates(keep='first').T
This part of the code took around 9 seconds my PC to find out 52 duplicate features out of 350. The shape of data is (92500, 350) and my windows PC is running with dual-core i5, 16 GB and 500 GB SSD.
Runtime: 9.71 s ± 299 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Version 2:
The instructor has suggested one more method as follows
# check for duplicated features in the training set
duplicated_feat = []
for i in range(0, len(X_train.columns)):
if i % 10 == 0: # this helps me understand how the loop is going
print(i)
col_1 = X_train.columns[i]
for col_2 in X_train.columns[i + 1:]:
if X_train[col_1].equals(X_train[col_2]):
duplicated_feat.append(col_2)
Runtime: 2min 16s ± 4.97 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Eventually, this took more than 2 mins to find out duplicated features But instructor has claimed that this the faster method if I have big data. Although, with my finding, I am not convinced by his claim.
The best way to do this is to use numpy to find the unique indices along columns (axis=1), then slice the original.
import numpy as np
import pandas as pd
df = pd.read_csv('data.csv')
_, idx = np.unique(df.to_numpy(), axis=1, return_index=True)
df_uniq = df.iloc[:, np.sort(idx)]
Some timings on my machine:
# First a sanity check they are equivalent (lucikly all values are non-Null)
(df_uniq == df.T.drop_duplicates(keep='first').T).all().all()
True
%%timeit
_, idx = np.unique(df.to_numpy(), axis=1, return_index=True)
df_uniq = df.iloc[:, np.sort(idx)]
#3.11 s ± 60.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.T.drop_duplicates(keep='first').T
#25.9 s ± 112 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I wont even bother with the loop because it's just bad.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With