IDF Calibration (Reverse Document Frequency) for document categorization

I doubt the calculation of IDF (Inverse Document Frequency) in the classification of documents. I have more than one category with several documents for training. I calculate the IDF for each term in the document using the following formula:

IDF(t,D)=log(Total Number documents/Number of Document matching term);

My questions:

  • What does “Total Number documents in Corpus” mean? Is the document signed from the current category or from all available categories?
  • What does “Number of documents matching the term” mean? Does the term match documents with the current category or from all available categories?
+5
source share
2 answers

Total Number documents in Corpus - , . , 20 , 20.

Number of Document matching term - t. , 20 , t 15 , Number of Documents matching term 15.

, IDF(t,D)=log(20/15) = 0.1249

, , , . - . , . tf*idf .

, . , , , , 1-.

- , idf . , , 0. , , , .

Smoothing , .

6.2 6.3 " " . , .

+9

All Articles