I am trying to recreate something similar to the
sklearn.preprocessing.LabelEncoder
However I do not want to use sklearn or pandas. I would like to only use numpy and the Python standard library. Here's what I would like to achieve:
import numpy as np
input = np.array([['hi', 'there'],
['scott', 'james'],
['hi', 'scott'],
['please', 'there']])
# Output would look like
np.ndarray([[0, 0],
[1, 1],
[0, 2],
[2, 0]])
It would also be great to be able to map it back as well, so a result would then look exactly like the input again.
If this were in a spreadsheet, the input would look like this:

Here's a simple comprehension, using the return_inverse result from np.unique
arr = np.array([['hi', 'there'], ['scott', 'james'],
['hi', 'scott'], ['please', 'there']])
np.column_stack([np.unique(arr[:, i], return_inverse=True)[1] for i in range(arr.shape[1])])
array([[0, 2],
[2, 0],
[0, 1],
[1, 2]], dtype=int64)
Or applying along the axis:
np.column_stack(np.apply_along_axis(np.unique, 0, arr, return_inverse=True)[1])
Was talking to @Scott Stoltzmann and spit balled about a way to reverse the accepted answer.
One can either carry the original arr along with them through out their program or record the mappings for each column. If you do the latter, here's some simple non-performant code to do so:
l = []
for real_column, encoded_column in zip(np.column_stack(arr), np.column_stack(arr2)):
d = {}
for real_element, encoded_element in zip(real_column, encoded_column):
d[encoded_element] = real_element
l.append(d)
print(l)
Doing this with the above yields:
[{0: 'hi', 2: 'scott', 1: 'please'}, {2: 'there', 0: 'james', 1: 'scott'}]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With