Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use reuters-21578 dataset with svm.net for text classification?

I've just started an application for text classification and I've read lots of papers about this topic, but till now I don't know how to start, I feel like I've not got the whole image. I've got the training dataset and read its description and got a great implementation for SVM algorithm (SVM.Net) but I don't know how to use that dataset with this implementation. I know that I should extract features from the dataset's texts and use these features as input to the SVM so could any body please tell me about a detailed tutorial about how to extract text's features and use them as input to the SVM algorithm, and then use this algorithm to classify a new text? And if there is a full example about using SVM for text classification, that's would be great.

Any help would be appreciated. Thanks in advance.

like image 446
Mousa Avatar asked Jan 26 '26 06:01

Mousa


1 Answers

Creating features for text classification can be as complex as you want it to be.

A simple approach is to just map each distinct term to a feature index. You then represent each document as a vector of the frequencies of each term. (You can remove stop words, weight terms etc etc). For text classification you would also assign each vector with the label.

For example, if the document was the sentence:

John loves Mary

with a label "spam".

Then you might have the following mapping:

John : 1
loves: 2
Mary: 3

Your vector then becomes:

1 1 2 1 3 1

(I has assumed that each feature has a weight of one)

I don't know about SVM.NET, but most supervised machine learning methods will accept vector-based input.

like image 149
Miles Osborne Avatar answered Jan 29 '26 12:01

Miles Osborne