I'm working with pytorch now, and I'm missing a layer: tf.keras.layers.StringLookup
that helped with the processing of ids. Is there any workaround to do something similar with pytorch?
An example of the functionality I'm looking for:
vocab = ["a", "b", "c", "d"]
data = tf.constant([["a", "c", "d"], ["d", "a", "b"]])
layer = tf.keras.layers.StringLookup(vocabulary=vocab)
layer(data)
Outputs:
<tf.Tensor: shape=(2, 3), dtype=int64, numpy=
array([[1, 3, 4],
[4, 1, 2]])>
Package torchnlp,
pip install pytorch-nlp
from torchnlp.encoders import LabelEncoder
data = ["a", "c", "d", "e", "d"]
encoder = LabelEncoder(data, reserved_labels=['unknown'], unknown_index=0)
enl = encoder.batch_encode(data)
print(enl)
tensor([1, 2, 3, 4, 3])
You can use Collections.Counter
along with torchtext
's vocab
object to construct a lookup function from your vocabulary. You can then easily pass sequences to this and get their encodings as a tensor:
from torchtext.vocab import vocab
from collections import Counter
tokens = ["a", "b", "c", "d"]
samples = [["a", "c", "d"], ["d", "a", "b"]]
# Build string lookup
lookup = vocab(Counter(tokens))
>>> torch.tensor([lookup(s) for s in samples])
tensor([[0, 2, 3],
[3, 0, 1]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With