Edit: see my answer. The problem was in our code. MR is working fine, it might have a status message problem, but at least input readers are working fine.
I did an experiment several times, and now I'm sure that mapreduce (or DatastoreInputReader) has weird behavior. I suspect that this may have something to do with key ranges and their splitting, but this is just my guess.
In any case, here we have the setting:
- We have an NDB model called "AdGroup" when creating new objects for this model - we use the same identifier that was returned from AdWords (this is an integer), but we use it as a string:
AdGroup(id=str(adgroupId)) - We have 1,163,871 of these objects in our data warehouse (that the Data Warehouse Administration page tells us - I know this is not a completely accurate number, but we donβt create / delete ad groups very often, so we can say with confidence that number is 1.1 million or more).
starts mapreduce (from another pipeline) as follows:
yield mapreduce_pipeline.MapreducePipeline(
job_name='AdGroup-process',
mapper_spec='process.adgroup_mapper',
reducer_spec='process.adgroup_reducer',
input_reader_spec='mapreduce.input_readers.DatastoreInputReader',
mapper_params={
'entity_kind': 'model.AdGroup',
'shard_count': 120,
'processing_rate': 500,
'batch_size': 20,
},
)
So, I tried to run this mapreduce several times today, without changing anything in the code or making changes to the data store. Each time I started it, the counter of the card counters had a different value from 450,000 to 550,000.
Correct me if I am wrong, but considering that I am using the most basic DatastoreInputReader - mapper-calls should be equal to the number of entities. That should be 1.1 million or more.
: , , , , " 4 , !".
, - blobstore ( ), BlobstoreLineInputReader. , blob , DatastoreInputReader. , - ?
. DatastoreKeyInputReader - - mapper-calls 450 000 550 000.
, , . , ? int ids str ids? , , mapreduce, , ?
PS: , .