I am working on a multi-class classification problem using LSTM and embeddings obtained from Universal sentence encoder.
Previously I was using Glove embeddings, and I get the required input shape for LSTM (batch_size, timesteps, input_dim). I am planning to use the Universal sentence encoder found that the output of Universal Sentence Encoder is 2d [batch, feature]. How can I make the required changes.
LSTM + Universal sentence encoder
EMBED_SIZE = 512
module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3"
embed = hub.Module(module_url)
def UniversalEmbedding(x):
return embed(tf.squeeze(tf.cast(x, tf.string)),
signature="default", as_dict=True)["default"]
seq_input = Input(shape=(MAX_SEQUENCE_LENGTH,),dtype='int32')
print("seq i",seq_input.shape,seq_input)
embedded_seq = Lambda(UniversalEmbedding,
output_shape=(EMBED_SIZE,))(seq_input)
print("EMD SEQ",embedding.shape,type(embedded_seq))
# (timesteps, n_features) (,MAX_SEQUENCE_LENGTH, EMBED_SIZE) (,150,512)
x_1 = LSTM(units=NUM_LSTM_UNITS,
name='blstm_1',
dropout=DROP_RATE_LSTM)(embedded_seq)
print(x_1)
This produces following error
seq i (?, 150) Tensor("input_8:0", shape=(?, 150), dtype=int32)
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
I0529 07:24:32.504808 140127577749376 saver.py:1483] Saver not created because there are no variables in the graph to restore
EMD SEQ (?, 512) <class 'tensorflow.python.framework.ops.Tensor'>
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-34-ea634319205b> in <module>()
12 x_1 = LSTM(units=NUM_LSTM_UNITS,
13 name='blstm_1',
---> 14 dropout=DROP_RATE_LSTM)(embedded_seq)
15 print(x_1)
16
2 frames
/usr/local/lib/python3.6/dist-packages/keras/engine/base_layer.py in assert_input_compatibility(self, inputs)
309 self.name + ': expected ndim=' +
310 str(spec.ndim) + ', found ndim=' +
--> 311 str(K.ndim(x)))
312 if spec.max_ndim is not None:
313 ndim = K.ndim(x)
ValueError: Input 0 is incompatible with layer blstm_1: expected ndim=3, found ndim=2
LSTM + Glove embeddings
embedding_layer = Embedding(nb_words,
EMBED_SIZE,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)
seq_input = Input(shape=(MAX_SEQUENCE_LENGTH,),dtype='int32')
print("SEQ INP",seq_input,seq_input.shape)
embedded_seq = embedding_layer(seq_input)
print("EMD SEQ",embedded_seq.shape)
# Bi-directional LSTM # (timesteps, n_features)
x_1 = Bidirectional(LSTM(units=NUM_LSTM_UNITS,
name='blstm_1',
dropout=DROP_RATE_LSTM,
recurrent_dropout=DROP_RATE_LSTM),
merge_mode='concat')(embedded_seq)
x_1 = Dropout(DROP_RATE_DENSE)(x_1)
x_1 = Dense(NUM_DENSE_UNITS,activation='relu')(x_1)
x_1 = Dropout(DROP_RATE_DENSE)(x_1)
OUTPUT (This works properly with LSTM)
SEQ INP Tensor("input_2:0", shape=(?, 150), dtype=int32) (?, 150)
EMD SEQ (?, 150, 300)
Sentence Encoder is different from word2vec or Glove, it's not word-level embeddings:
The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector. We apply this model to the STS benchmark for semantic similarity, and the results can be seen in the example notebook made available. The universal-sentence-encoder model is trained with a deep averaging network (DAN) encoder.
The example above where they used "lambda" function is for FF neural network, and the input to the next layer is 2D, unlike RNN of CNN (3D).
Shortly, what you have to do is to prepare your text before then feed it to your network with Embedding layer:
def process_text(sentences_list):
path = './processed_data'
embeddings_file = "embeddings-{}.pickle".format(len(sentences_list))
if not os.path.isfile(join(path, embeddings_file)):
module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3"
embed = hub.Module(module_url)
with tf.Session() as sess:
sess.run([tf.global_variables_initializer(), tf.tables_initializer()])
sentences_list = sess.run(embed(sentences_list))
sentences_list = np.array(sentences_list)
sentences_list = np.array([np.reshape(embedding, (len(embedding), 1)) for embedding in sentences_list])
pickle.dump(sentences_list, open(embeddings_file, 'wb'))
else:
sentences_list = pickle.load(open(join(path, embeddings_file), 'rb'))
return sentences_list
I recommend you to save the generated embeddings, as I do in the example, because it will take few time to retrieve the embeddings.
Source: Sentiment Analysis on Twitter Data using Universal Sentence Encoder
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With