I want to extract potential sentences from news articles which can be part of article summary.
Upon spending some time, I found out that this can be achieved in two ways,
Reference: rare-technologies.com
I followed abigailsee's Get To The Point: Summarization with Pointer-Generator Networks for summarization which was producing good results with the pre-trained model but it was abstractive.
The Problem: Most of the extractive summarizers that I have looked so far(PyTeaser, PyTextRank and Gensim) are not based on Supervised learning but on Naive Bayes classifier, tf–idf, POS-tagging, sentence ranking based on keyword-frequency, position etc., which don't require any training.
Few things that I have tried so far to extract potential summary sentences.
from keras.preprocessing.text import Tokenizer with Vocabulary size of 20000 and pad all sequences to average length of all sentences.model_lstm = Sequential()
model_lstm.add(Embedding(20000, 100, input_length=sentence_avg_length))
model_lstm.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model_lstm.add(Dense(1, activation='sigmoid'))
model_lstm.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
This is giving very low accuracy ~0.2
I think this is because the above model is more suitable for positive/negative sentences rather than summary/non-summary sentences classification.
Any guidance on approach to solve this problem would be appreciated.
I think this is because the above model is more suitable for positive/negative sentences rather than summary/non-summary sentences classification.
That's right. The above model is used for binary classification, not text summarization. If you notice, the output (Dense(1, activation='sigmoid')) only gives you a score between 0-1 while in text summarization we need a model that generates a sequence of tokens.
What should I do?
The dominant idea to tackle this problem is encoder-decoder (also known as seq2seq) models. There is a nice tutorial on Keras repository which used for Machine translation but it is fairly easy to adapt it for text summarization.
The main part of the code is:
from keras.models import Model
from keras.layers import Input, LSTM, Dense
# Define an input sequence and process it.
encoder_inputs = Input(shape=(None, num_encoder_tokens))
encoder = LSTM(latent_dim, return_state=True)
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
# We discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]
# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None, num_decoder_tokens))
# We set up our decoder to return full output sequences,
# and to return internal states as well. We don't use the
# return states in the training model, but we will use them in inference.
decoder_lstm = LSTM(latent_dim, return_sequences=True, return_state=True)
decoder_outputs, _, _ = decoder_lstm(decoder_inputs,
initial_state=encoder_states)
decoder_dense = Dense(num_decoder_tokens, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
# Run training
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
batch_size=batch_size,
epochs=epochs,
validation_split=0.2)
Based on the above implementation, it is necessary to pass encoder_input_data, decoder_input_data and decoder_target_data to model.fit() which respectively are input text and summarize version of the text.
Note that, decoder_input_data and decoder_target_data are the same things except that decoder_target_data is one token ahead of decoder_input_data.
This is giving very low accuracy ~0.2
I think this is because the above model is more suitable for positive/negative sentences rather than summary/non-summary sentences classification.
The low accuracy performance caused by various reasons including small training size, overfitting, underfitting and etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With