Using HBase to extract data to calculate text affinities using Mahout

Question

In my project, we are trying to calculate the text similarity of a set of documents for which I encountered two problems.

I do not want to recount the time frequency of the documents that I previously calculated. for example, I have 10 documents, and I calculated the time frequency and the reverse frequency of documents for all 10 documents. Then I get 2 more documents. Now I do not want to calculate the time frequency for 10 existing documents, but I want to calculate TF for the new 2 that are included, and then use TF for all 12 documents and calculate IDF for 12 documents as all. How to calculate IDF of all documents without re-calculating TF of existing documents?
The number of documents can increase, which means using a memory approach (InMemoryBayesDatastore) can become cumbersome. I want to save the TF of all documents in the HBASE table and when new documents arrive, I calculate the TF of new documents, save them in the HBASE table, and then I use this HBASE table to extract the TF of all documents for IDF calculation. How can I use HBase to provide Mahout text affinity data instead of extracting it from a sequence file?

+3

Jhs May 18 '12 at 10:36

1 answer

Tucker · Answer 1 · 2012-07-04T04:51:11+0000

, MR HDFS Hbase. , , , , TF Term rowkey, , ( ). 1 MR , .

, .

MR, (.. ). , " ". - , , , , .