I think Machine learning is interesting and I am studying the scikit learn documentation for fun. Below I have done some data cleaning and the thing is that I want to use grid search to find the best values for the parameters.
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
cats = ['sci.space','rec.autos','rec.motorcycles']
newsgroups_train = fetch_20newsgroups(subset='train',remove=('headers', 'footers', 'quotes'), categories = cats)
newsgroups_test = fetch_20newsgroups(subset='test',remove=('headers', 'footers', 'quotes'), categories = cats)
vectorizer = TfidfVectorizer( stop_words = "english")
vectors = vectorizer.fit_transform(newsgroups_train.data)
vectors_test = vectorizer.transform(newsgroups_test.data)
clf = SVC(C=0.4,gamma=1,kernel='linear')
clf.fit(vectors, newsgroups_train.target)
vectors_test = vectorizer.transform(newsgroups_test.data)
pred = clf.predict(vectors_test)
print(accuracy_score(newsgroups_test.target, pred))
The accuracy is: 0.849
I have heard of grid search in order to find the optimal value of parameters but I can't understand how to perform it. Can you please elaborate? This is what I tried but is not correct. I would like to learn the correct way along with some explanation. Thanks
Cs = np.array([0.001, 0.01, 0.1, 1, 10])
gammas = np.array([0.001, 0.01, 0.1, 1])
model = SVC()
grid = GridSearchCV(estimator=model, param_grid=dict(Cs=alphas,gamma=gammas))
grid.fit(newsgroups_train.data, newsgroups_train.target)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_.alpha)
parameters = {'C': [1, 10],
'gamma': [0.001, 0.01, 1]}
model = SVC()
grid = GridSearchCV(estimator=model, param_grid=parameters)
grid.fit(vectors, newsgroups_train.target)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_)
it returns:
GridSearchCV(cv='warn', error_score='raise-deprecating',
estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False),
fit_params=None, iid='warn', n_jobs=None,
param_grid={'C': [1, 10], 'gamma': [0.001, 0.01, 1]},
pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
scoring=None, verbose=0)
0.8532212885154061
SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
I need clarifications on these:
1)What actually is displayed on the results?
2)Does it also take ranges for C as 1 to 10 or either 1 or 10?
3)Can you suggest anything to improve accuracy further?
4)I noticed that the Tfidf made the accuracy worse even though it
cleaned the data from words that dont have any value
You want to pass a dictionary of parameters where the keys are the name of the parameter as defined by the model's documentation (1). The values should be a list of the values you would like to try.
The grid search will then call every possible combination of those parameters. There are some good examples with the documentation (2).
For your script, you also want to make sure that you are feeding the grid search the correct training data, in this case, 'vectors' not 'newsgroups_test.data'.
See below:
parameters = {'C': [1, 10],
'gamma': [0.001, 0.01, 1]}
model = SVC()
grid = GridSearchCV(estimator=model, param_grid=parameters)
grid.fit(vectors, newsgroups_train.target)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_)
Please accept the answer if it works. Good luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With