I have some questions about the TfidfVectorizer.
It is unclear to me how the words are selected. We can give a minimum support, but after that, what will decide which features will be selected (e.g. higher support more chance)? If we say max_features = 10000, do we always get the same? If we say max_features = 12000, will we get the same 10000 features, but an extra added 2000? 
Also, is there a way to extend the, say, max_features=20000 features? I fit it on some text, but I know of some words that should be included for sure, and also some emoticons ":-)" etc. How to add these to the TfidfVectorizer object, so that it will be possible to use the object, use it to fit and predict
to_include = [":-)", ":-P"]
method = TfidfVectorizer(max_features=20000, ngram_range=(1, 3),
                      # I know stopwords, but how about include words?
                      stop_words=test.stoplist[:100], 
                      # include words ??
                      analyzer='word',
                      min_df=5)
method.fit(traindata)
X = method.transform(traindata)
X
<Nx20002 sparse matrix of type '<class 'numpy.int64'>'
 with 1135520 stored elements in Compressed Sparse Row format>], 
 where N is sample size
From the way the TfIdf score is set up, there shouldn't be any significant difference in removing the stopwords. The whole point of the Idf is exactly to remove words with no semantic value from the corpus. If you do add the stopwords, the Idf should get rid of it.
Tfidftransformer and Tfidfvectorizer aim to do the same thing, which is to convert a collection of raw documents to a matrix of TF-IDF features. The only difference is that with Tfidftransformer, you will systematically compute the word counts, generate idf values and then compute a tfidf score or set of scores.
You can just use TfidfVectorizer with use_idf=True (default value) and then extract with idf_. How would you get the IDF value, for example for the term "not". IDF ("not")= something? The attributes "vocabulary_" give you the mapping between the word and the feature indice.
TF-IDF is better than Count Vectorizers because it not only focuses on the frequency of words present in the corpus but also provides the importance of the words. We can then remove the words that are less important for analysis, hence making the model building less complex by reducing the input dimensions.
You are asking several separate questions. Let me answer them separately:
"It is unclear to me how the words are selected."
From the documentation:
max_features : optional, None by default
    If not None, build a vocabulary that only consider the top
    max_features ordered by term frequency across the corpus.
All the features (in your case unigrams, bigrams and trigrams) are ordered by frequency in the entire corpus, and then the top 10000 are selected. The uncommon words are thrown out. 
"If we say max_features = 10000, do we always get the same? If we say max_features = 12000, will we get the same 10000 features, but an extra added 2000?"
Yes. The process is deterministic: for a given corpus and a given max_features, you will always get the same features.
I fit it on some text, but I know of some words that should be included for sure, [...] How to add these to the TfidfVectorizer object?
You use the vocabulary parameter to specify what features should be used. For example, if you want only emoticons to be extracted, you can do the following:
emoticons = {":)":0, ":P":1, ":(":2}
vect = TfidfVectorizer(vocabulary=emoticons)
matrix = vect.fit_transform(traindata)
This will return a <Nx3 sparse matrix of type '<class 'numpy.int64'>' with M stored elements in Compressed Sparse Row format>]. Notice there are only 3 columns, one for each feature. 
If you want the vocabulary to include the emoticons as well as the N most common features, you could calculate the most frequent features first, then merge them with the emoticons and re-vectorize like so:
# calculate the most frequent features first
vect = TfidfVectorizer(vocabulary=emoticons, max_features=10)
matrix = vect.fit_transform(traindata)
top_features = vect.vocabulary_
n = len(top_features)
# insert the emoticons into the vocabulary of common features
emoticons = {":)":0, ":P":1, ":(":2)}
for feature, index in emoticons.items():
    top_features[feature] = n + index
# re-vectorize using both sets of features
# at this point len(top_features) == 13
vect = TfidfVectorizer(vocabulary=top_features)
matrix = vect.fit_transform(traindata)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With