In my project, we are trying to calculate the text similarity of a set of documents for which I encountered two problems.
I do not want to recount the time frequency of the documents that I previously calculated. for example, I have 10 documents, and I calculated the time frequency and the reverse frequency of documents for all 10 documents. Then I get 2 more documents. Now I do not want to calculate the time frequency for 10 existing documents, but I want to calculate TF for the new 2 that are included, and then use TF for all 12 documents and calculate IDF for 12 documents as all. How to calculate IDF of all documents without re-calculating TF of existing documents?
The number of documents can increase, which means using a memory approach (InMemoryBayesDatastore) can become cumbersome. I want to save the TF of all documents in the HBASE table and when new documents arrive, I calculate the TF of new documents, save them in the HBASE table, and then I use this HBASE table to extract the TF of all documents for IDF calculation. How can I use HBase to provide Mahout text affinity data instead of extracting it from a sequence file?
, MR HDFS Hbase. , , , , TF Term rowkey, , ( ). 1 MR , .
, .
MR, (.. ). , " ". - , , , , .