Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which version of duplicate feature columns removal in machine learning is faster and why?

Tags:

python

pandas

I am taking an ML course at Udemy and currently reading about Feature Engineering. There is a need to remove duplicate columns (feature) from the dataset and author has suggested two versions of the code.

Data Set Download Link

Version 1:
Version 1 uses matrix transpose and then applied duplicated() method as follows

data_unique = data.T.drop_duplicates(keep='first').T

This part of the code took around 9 seconds my PC to find out 52 duplicate features out of 350. The shape of data is (92500, 350) and my windows PC is running with dual-core i5, 16 GB and 500 GB SSD.
Runtime: 9.71 s ± 299 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Version 2:
The instructor has suggested one more method as follows

# check for duplicated features in the training set
duplicated_feat = []
for i in range(0, len(X_train.columns)):
    if i % 10 == 0:  # this helps me understand how the loop is going
        print(i)

    col_1 = X_train.columns[i]

    for col_2 in X_train.columns[i + 1:]:
        if X_train[col_1].equals(X_train[col_2]):
            duplicated_feat.append(col_2)

Runtime: 2min 16s ± 4.97 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Eventually, this took more than 2 mins to find out duplicated features But instructor has claimed that this the faster method if I have big data. Although, with my finding, I am not convinced by his claim.

like image 464
Samual Avatar asked Dec 08 '25 10:12

Samual


1 Answers

The best way to do this is to use numpy to find the unique indices along columns (axis=1), then slice the original.

import numpy as np
import pandas as pd
df = pd.read_csv('data.csv')

_, idx = np.unique(df.to_numpy(), axis=1, return_index=True)
df_uniq = df.iloc[:, np.sort(idx)]

Some timings on my machine:

# First a sanity check they are equivalent (lucikly all values are non-Null)
(df_uniq == df.T.drop_duplicates(keep='first').T).all().all()
True

%%timeit 
_, idx = np.unique(df.to_numpy(), axis=1, return_index=True)
df_uniq = df.iloc[:, np.sort(idx)]
#3.11 s ± 60.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df.T.drop_duplicates(keep='first').T
#25.9 s ± 112 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I wont even bother with the loop because it's just bad.

like image 159
ALollz Avatar answered Dec 09 '25 23:12

ALollz