I was reading about TfidfVectorizer implementation of scikit-learn, i don´t understand what´s the output of the method, for example:
new_docs = ['He watches basketball and baseball', 'Julie likes to play basketball', 'Jane loves to play baseball'] new_term_freq_matrix = tfidf_vectorizer.transform(new_docs) print tfidf_vectorizer.vocabulary_ print new_term_freq_matrix.todense() output:
{u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2} [[ 0.57735027  0.57735027  0.57735027  0.          0.          0.          0.    0.          0.          0.          0.        ]  [ 0.          0.68091856  0.          0.          0.51785612  0.51785612    0.          0.          0.          0.          0.        ]  [ 0.62276601  0.          0.          0.62276601  0.          0.          0.    0.4736296   0.          0.          0.        ]] What is?(e.g.: u'me': 8 ):
{u'me': 8, u'basketball': 1, u'julie': 4, u'baseball': 0, u'likes': 5, u'loves': 7, u'jane': 3, u'linda': 6, u'more': 9, u'than': 10, u'he': 2} is this a matrix or just a vector?, i can´t understand what´s telling me the output:
[[ 0.57735027  0.57735027  0.57735027  0.          0.          0.          0.    0.          0.          0.          0.        ]  [ 0.          0.68091856  0.          0.          0.51785612  0.51785612    0.          0.          0.          0.          0.        ]  [ 0.62276601  0.          0.          0.62276601  0.          0.          0.    0.4736296   0.          0.          0.        ]] Could anybody explain me in more detail these outputs?
Thanks!
Scikit-learn's Tfidftransformer and Tfidfvectorizer aim to do the same thing, which is to convert a collection of raw documents to a matrix of TF-IDF features. The differences between the two modules can be quite confusing and it's hard to know when to use which.
In TfidfVectorizer we consider overall document weightage of a word. It helps us in dealing with most frequent words. Using it we can penalize them. TfidfVectorizer weights the word counts by a measure of how often they appear in the documents.
It converts a collection of raw documents to a matrix of TF-IDF features. As tf–idf is very often used for text features, the class TfidfVectorizer combines all the options of CountVectorizer and TfidfTransformer into a single model.
TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations (words, phrases, lemmas, etc) in a document amongst a collection of documents (also known as a ...
TfidfVectorizer - Transforms text to feature vectors that can be used as input to estimator.
vocabulary_ Is a dictionary that converts each token (word) to feature index in the matrix, each unique token gets a feature index.
What is?(e.g.: u'me': 8 )
It tells you that the token 'me' is represented as feature number 8 in the output matrix.
is this a matrix or just a vector?
Each sentence is a vector, the sentences you've entered are matrix with 3 vectors. In each vector the numbers (weights) represent features tf-idf score. For example: 'julie': 4 --> Tells you that the in each sentence 'Julie' appears you will have non-zero (tf-idf) weight. As you can see in the 2'nd vector:
[ 0. 0.68091856 0. 0. 0.51785612 0.51785612 0. 0. 0. 0. 0. ]
The 5'th element scored 0.51785612 - the tf-idf score for 'Julie'. For more info about Tf-Idf scoring read here: http://en.wikipedia.org/wiki/Tf%E2%80%93idf
So tf-idf creates a set of its own vocabulary from the entire set of documents. Which is seen in first line of output. (for better understanding I have sorted it)
{u'baseball': 0, u'basketball': 1, u'he': 2, u'jane': 3, u'julie': 4, u'likes': 5, u'linda': 6,  u'loves': 7, u'me': 8, u'more': 9, u'than': 10, } And when the document is parsed to get its tf-idf. Document:
He watches basketball and baseball
and its output,
[ 0.57735027 0.57735027 0.57735027 0. 0. 0. 0. 0. 0. 0. 0. ]
is equivalent to,
[baseball basketball he jane julie likes linda loves me more than]
Since our document has only these words: baseball, basketball, he, from the vocabulary created. The document vector output has values of tf-idf for only these three words and in the same sorted vocabulary position.
tf-idf is used to classify documents, ranking in search engine. tf: term frequency(count of the words present in document from its own vocabulary), idf: inverse document frequency(importance of the word to each document).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With