Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

LabelEncoder order of fit for a Pandas df

I am fitting a scikit-learn LabelEncoder on a column in a pandas df.

How is the order, in which the encountered strings are mapped to the integers, determined? Is it deterministic?

More importantly, can I specify this order?

import pandas as pd
from sklearn import preprocessing

df = pd.DataFrame(data=["first", "second", "third", "fourth"], columns=['x'])
le = preprocessing.LabelEncoder()
le.fit(df['x'])
print list(le.classes_)
### this prints ['first', 'fourth', 'second', 'third']
encoded = le.transform(["first", "second", "third", "fourth"]) 
print encoded
### this prints [0 2 3 1]

I would expect le.classes_ to be ["first", "second", "third", "fourth"] and then encoded to be [0 1 2 3], since this is the order in which the strings appear in the column. Can this be done?

like image 556
tkja Avatar asked Dec 17 '25 19:12

tkja


2 Answers

It's done in sort order. In the case of strings, it is done in alphabetic order. There's no documentation for this, but looking at the source code for LabelEncoder.transform we can see the work is mostly delegated to the function numpy.setdiff1d, with the following documentation:

Find the set difference of two arrays.

Return the sorted, unique values in ar1 that are not in ar2.

(Emphasis mine).

Note that since this is not documented, it is probably implementation defined and can be changed between versions. It could be that just the version I looked use the sort order, and other versions of scikit-learn may change this behavior (by not using numpy.setdiff1d).

like image 96
Mephy Avatar answered Dec 19 '25 08:12

Mephy


I was also a bit surprised that I cannot provide an order to LabelEncoder. A one line solution can be like this:

df['col1_num'] = df['col1'].apply(lambda x: ['first', 'second', 'third', 'fourth'].index(x))
like image 38
SaTa Avatar answered Dec 19 '25 08:12

SaTa



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!