High Volume Recording in HBase

I have a Hadoop MapReduce job whose output is a line identifier with a Put / Delete operation for that line identifier. Due to the nature of the problem, the output is quite large. We tried several ways to return this data to HBase, and they all failed ...

Table reducer

This is a way to slow down, as it seems that it should do a full round for each row. Due to the way the keys are sorted for our reducer step, the row identifier is unlikely to be on the same node as the reducer.

completebulkload

It seems to be time consuming (never completes), and there is no real indication of why. Both IO and CPU show very low usage.

Am I missing something obvious?

+3
source share
2 answers

CompleteBulkLoad was the correct answer. Per @DonaldMiner I went deeper and found that the CompleteBulkLoad process started as "hbase", which resulted in a denied permission when trying to move / rename / delete the source files. Apparently, the implementation retries for a long time before giving an error message; up to 30 minutes in our case.

Giving the user hbase file permissions fixes the problem.

0
source

, , , - hbase. , HBase, 15 000 1 . node

+1

All Articles