Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generating Input for LSTM from universal sentence encoder output

I am working on a multi-class classification problem using LSTM and embeddings obtained from Universal sentence encoder.

Previously I was using Glove embeddings, and I get the required input shape for LSTM (batch_size, timesteps, input_dim). I am planning to use the Universal sentence encoder found that the output of Universal Sentence Encoder is 2d [batch, feature]. How can I make the required changes.

LSTM + Universal sentence encoder

EMBED_SIZE = 512

module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3"
embed = hub.Module(module_url)

def UniversalEmbedding(x):
    return embed(tf.squeeze(tf.cast(x, tf.string)), 
            signature="default", as_dict=True)["default"]


seq_input = Input(shape=(MAX_SEQUENCE_LENGTH,),dtype='int32')
print("seq i",seq_input.shape,seq_input)

embedded_seq = Lambda(UniversalEmbedding,                          
                          output_shape=(EMBED_SIZE,))(seq_input)
print("EMD SEQ",embedding.shape,type(embedded_seq))

# (timesteps, n_features) (,MAX_SEQUENCE_LENGTH, EMBED_SIZE) (,150,512)
x_1 = LSTM(units=NUM_LSTM_UNITS,
                            name='blstm_1',
                        dropout=DROP_RATE_LSTM)(embedded_seq)
print(x_1)

This produces following error

seq i (?, 150) Tensor("input_8:0", shape=(?, 150), dtype=int32)
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
I0529 07:24:32.504808 140127577749376 saver.py:1483] Saver not created because there are no variables in the graph to restore
EMD SEQ (?, 512) <class 'tensorflow.python.framework.ops.Tensor'>
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-34-ea634319205b> in <module>()
     12 x_1 = LSTM(units=NUM_LSTM_UNITS,
     13                             name='blstm_1',
---> 14                         dropout=DROP_RATE_LSTM)(embedded_seq)
     15 print(x_1)
     16 

2 frames
/usr/local/lib/python3.6/dist-packages/keras/engine/base_layer.py in assert_input_compatibility(self, inputs)
    309                                      self.name + ': expected ndim=' +
    310                                      str(spec.ndim) + ', found ndim=' +
--> 311                                      str(K.ndim(x)))
    312             if spec.max_ndim is not None:
    313                 ndim = K.ndim(x)

ValueError: Input 0 is incompatible with layer blstm_1: expected ndim=3, found ndim=2

LSTM + Glove embeddings

embedding_layer = Embedding(nb_words,
                            EMBED_SIZE, 
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)

seq_input = Input(shape=(MAX_SEQUENCE_LENGTH,),dtype='int32')
print("SEQ INP",seq_input,seq_input.shape)
embedded_seq = embedding_layer(seq_input)
print("EMD SEQ",embedded_seq.shape)

# Bi-directional LSTM  # (timesteps, n_features)
x_1 = Bidirectional(LSTM(units=NUM_LSTM_UNITS,
                         name='blstm_1',
                         dropout=DROP_RATE_LSTM,
                         recurrent_dropout=DROP_RATE_LSTM),
                    merge_mode='concat')(embedded_seq)
x_1 = Dropout(DROP_RATE_DENSE)(x_1)
x_1 = Dense(NUM_DENSE_UNITS,activation='relu')(x_1)
x_1 = Dropout(DROP_RATE_DENSE)(x_1)

OUTPUT (This works properly with LSTM)

SEQ INP Tensor("input_2:0", shape=(?, 150), dtype=int32) (?, 150)
EMD SEQ (?, 150, 300)
like image 414
joel Avatar asked Sep 03 '25 17:09

joel


1 Answers

Sentence Encoder is different from word2vec or Glove, it's not word-level embeddings:

The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector. We apply this model to the STS benchmark for semantic similarity, and the results can be seen in the example notebook made available. The universal-sentence-encoder model is trained with a deep averaging network (DAN) encoder.

The example above where they used "lambda" function is for FF neural network, and the input to the next layer is 2D, unlike RNN of CNN (3D).

Shortly, what you have to do is to prepare your text before then feed it to your network with Embedding layer:

def process_text(sentences_list):
    path = './processed_data'
    embeddings_file = "embeddings-{}.pickle".format(len(sentences_list))
    if not os.path.isfile(join(path, embeddings_file)):
        module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/3"
        embed = hub.Module(module_url)
        with tf.Session() as sess:
            sess.run([tf.global_variables_initializer(), tf.tables_initializer()])
            sentences_list = sess.run(embed(sentences_list))
        sentences_list = np.array(sentences_list)
        sentences_list = np.array([np.reshape(embedding, (len(embedding), 1)) for embedding in sentences_list])
        pickle.dump(sentences_list, open(embeddings_file, 'wb'))
    else:
        sentences_list = pickle.load(open(join(path, embeddings_file), 'rb'))
    return sentences_list

I recommend you to save the generated embeddings, as I do in the example, because it will take few time to retrieve the embeddings.

Source: Sentiment Analysis on Twitter Data using Universal Sentence Encoder

like image 126
Minions Avatar answered Sep 07 '25 18:09

Minions