How to use reuters-21578 dataset with svm.net to classify text?

Question

How to use reuters-21578 dataset with svm.net to classify text?

I just started a text classification application and I read many articles on this topic, but so far I don’t know how to start, I feel that I do not have the whole image. I have a training dataset and read its description and got a great implementation for the SVM algorithm (SVM.Net), but I don’t know how to use this dataset with this implementation. I know that I have to extract functions from dataset texts and use these functions as input for SVM, so any authority please tell me about a detailed guide on how to extract text functions and use them as input for an algorithm SVM and then use this algorithm to classify new text? And if there is a complete example of using SVM to classify text, that would be great.

Any help would be greatly appreciated. Thanks in advance.

+3

machine-learning nlp svm document-classification

Mousa May 23 '11 at 12:39

source share

1 answer

Miles Osborne · Accepted Answer · 2011-05-23T13:24:58+0000

Creating functions to classify text can be as complex as you want.

A simple approach is to simply map each individual term to the index of the function. Then you present each document as a frequency vector of each term. (You can delete stop words, weight conditions, etc. Etc.). To classify the text, you also assign a label to each vector.

For example, if the document was a sentence:

John loves Mary

labeled spam.

Then you may have the following mapping:

John : 1
loves: 2
Mary: 3

Then your vector will look like this:

1 1 2 1 3 1

(I assumed that each trait has the weight of one)

SVM.NET, .

How to use reuters-21578 dataset with svm.net to classify text?

More articles: