How can I create a label encoder utilizing only numpy (and not sklearn LabelEncoder)?

Question

I am trying to recreate something similar to the sklearn.preprocessing.LabelEncoder

However I do not want to use sklearn or pandas. I would like to only use numpy and the Python standard library. Here's what I would like to achieve:

import numpy as np
input = np.array([['hi', 'there'],
                     ['scott', 'james'],
                     ['hi', 'scott'],
                     ['please', 'there']])

# Output would look like
np.ndarray([[0, 0],
            [1, 1],
            [0, 2],
            [2, 0]])

It would also be great to be able to map it back as well, so a result would then look exactly like the input again.

If this were in a spreadsheet, the input would look like this: enter image description here

ALollz · Accepted Answer

Here's a simple comprehension, using the return_inverse result from np.unique

arr = np.array([['hi', 'there'], ['scott', 'james'],
                ['hi', 'scott'], ['please', 'there']])

np.column_stack([np.unique(arr[:, i], return_inverse=True)[1] for i in range(arr.shape[1])])

array([[0, 2],
       [2, 0],
       [0, 1],
       [1, 2]], dtype=int64)

Or applying along the axis:

np.column_stack(np.apply_along_axis(np.unique, 0, arr, return_inverse=True)[1])

Greg Hilston · Answer

Was talking to @Scott Stoltzmann and spit balled about a way to reverse the accepted answer.

One can either carry the original arr along with them through out their program or record the mappings for each column. If you do the latter, here's some simple non-performant code to do so:

l = []

for real_column, encoded_column in zip(np.column_stack(arr), np.column_stack(arr2)):    
    d = {}
    for real_element, encoded_element in zip(real_column, encoded_column):
        d[encoded_element] = real_element
    l.append(d)
print(l)

Doing this with the above yields:

[{0: 'hi', 2: 'scott', 1: 'please'}, {2: 'there', 0: 'james', 1: 'scott'}]

How can I create a label encoder utilizing only numpy (and not sklearn LabelEncoder)?

Tags:

python

pandas

numpy

scikit-learn

Scott Stoltzman

2 Answers

ALollz

Greg Hilston

Recent Activity

Donate For Us

How can I create a label encoder utilizing only numpy (and not sklearn LabelEncoder)?

Tags:

python

pandas

numpy

scikit-learn

Scott Stoltzman

2 Answers

ALollz

Greg Hilston

Related questions

Recent Activity

Donate For Us