Using HBase to extract data to calculate text affinities using Mahout

In my project, we are trying to calculate the text similarity of a set of documents for which I encountered two problems.

  • I do not want to recount the time frequency of the documents that I previously calculated. for example, I have 10 documents, and I calculated the time frequency and the reverse frequency of documents for all 10 documents. Then I get 2 more documents. Now I do not want to calculate the time frequency for 10 existing documents, but I want to calculate TF for the new 2 that are included, and then use TF for all 12 documents and calculate IDF for 12 documents as all. How to calculate IDF of all documents without re-calculating TF of existing documents?

  • The number of documents can increase, which means using a memory approach (InMemoryBayesDatastore) can become cumbersome. I want to save the TF of all documents in the HBASE table and when new documents arrive, I calculate the TF of new documents, save them in the HBASE table, and then I use this HBASE table to extract the TF of all documents for IDF calculation. How can I use HBase to provide Mahout text affinity data instead of extracting it from a sequence file?

+3
source share
1 answer

, MR HDFS Hbase. , , , , TF Term rowkey, , ( ). 1 MR , .

, .

MR, (.. ). , " ". - , , , , .

+1

All Articles