I have a Hadoop MapReduce job whose output is a line identifier with a Put / Delete operation for that line identifier. Due to the nature of the problem, the output is quite large. We tried several ways to return this data to HBase, and they all failed ...
Table reducer
This is a way to slow down, as it seems that it should do a full round for each row. Due to the way the keys are sorted for our reducer step, the row identifier is unlikely to be on the same node as the reducer.
completebulkload
It seems to be time consuming (never completes), and there is no real indication of why. Both IO and CPU show very low usage.
Am I missing something obvious?
source
share