I have trained a corpus for LDA topic modelling using gensim.
Going through the tutorial on the gensim website (this is not the whole code):
question = 'Changelog generation from Github issues?';
temp = question.lower()
for i in range(len(punctuation_string)):
    temp = temp.replace(punctuation_string[i], '')
words = re.findall(r'\w+', temp, flags = re.UNICODE | re.LOCALE)
important_words = []
important_words = filter(lambda x: x not in stoplist, words)
print important_words
dictionary = corpora.Dictionary.load('questions.dict')
ques_vec = []
ques_vec = dictionary.doc2bow(important_words)
print dictionary
print ques_vec
print lda[ques_vec]
This is the output that I get:
['changelog', 'generation', 'github', 'issues']
Dictionary(15791 unique tokens)
[(514, 1), (3625, 1), (3626, 1), (3627, 1)]
[(4, 0.20400000000000032), (11, 0.20400000000000032), (19, 0.20263215848547525), (29, 0.20536784151452539)]
I don't know how the last output is going to help me find the possible topic for the question !!!
Please help!
Method 1: Try out different values of k, select the one that has the largest likelihood. Method 3: If the HDP-LDA is infeasible on your corpus (because of corpus size), then take a uniform sample of your corpus and run HDP-LDA on that, take the value of k as given by HDP-LDA.
To decide on a suitable number of topics, you can compare the goodness-of-fit of LDA models fit with varying numbers of topics. You can evaluate the goodness-of-fit of an LDA model by calculating the perplexity of a held-out set of documents. The perplexity indicates how well the model describes a set of documents.
LDA operates in the same way as PCA does. LDA is applied to the text data. It works by decomposing the corpus document word matrix (the larger matrix) into two parts (smaller matrices): the Document Topic Matrix and the Topic Word. Therefore, LDA like PCA is a matrix factorization technique.
I have written a function in python that gives the possible topic for a new query:
def getTopicForQuery (question):
    temp = question.lower()
    for i in range(len(punctuation_string)):
        temp = temp.replace(punctuation_string[i], '')
    words = re.findall(r'\w+', temp, flags = re.UNICODE | re.LOCALE)
    important_words = []
    important_words = filter(lambda x: x not in stoplist, words)
    dictionary = corpora.Dictionary.load('questions.dict')
    ques_vec = []
    ques_vec = dictionary.doc2bow(important_words)
    topic_vec = []
    topic_vec = lda[ques_vec]
    word_count_array = numpy.empty((len(topic_vec), 2), dtype = numpy.object)
    for i in range(len(topic_vec)):
        word_count_array[i, 0] = topic_vec[i][0]
        word_count_array[i, 1] = topic_vec[i][1]
    idx = numpy.argsort(word_count_array[:, 1])
    idx = idx[::-1]
    word_count_array = word_count_array[idx]
    final = []
    final = lda.print_topic(word_count_array[0, 0], 1)
    question_topic = final.split('*') ## as format is like "probability * topic"
    return question_topic[1]
Before going through this do refer this link!
In the initial part of the code, the query is being pre-processed so that it can be stripped off stop words and unnecessary punctuations.
Then, the dictionary that was made by using our own database is loaded.
We, then, we convert the tokens of the new query to bag of words and then the topic probability distribution of the query is calculated by topic_vec = lda[ques_vec] where lda is the trained model as explained in the link referred above.
The distribution is then sorted w.r.t the probabilities of the topics. The topic with the highest probability is then displayed by question_topic[1].
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With