Adding to what Chris added above:
The number of cards is usually determined by the number of DFS blocks in the input files. Although this forces people to adjust the DFS block size to adjust the number of cards.
The correct level of parallelism for maps seems to be around 10-100 maps / node, although this can reach 300 or so for very complex map tasks. Setting up a task takes some time, so itβs best if the cards run for at least a minute.
, JobConf conf.setNumMapTasks(int num). . , , Hadoop .
, . mapred.map.tasks - InputFormat . InputFormat , . DFS .
mapred.min.split.size.
, 10TB 128MB DFS, 82k-, mapred.map.tasks . InputFormat .
: http://wiki.apache.org/hadoop/HowManyMapsAndReduces