python - App Engine MapReduce performance optimization -
performance difficult subject, let me try anyway. i'm using app engine mapreduce straightforward analysis , feel i'm not getting kind of performance expect.
- i have app engine module dedicated running single mapreduce pipeline.
- the module instances use basic scaling, instance class b4_1g , 16 maximum instances.
- the pipeline uses queue allows 100 concurrent requests.
- the pipeline uses 64 shards.
as example of kind of performance i'm seeing, here 1 of map functions:
map(data): """data handle cloudstorage file""" line = data.readline() while line: val in creates_values(line): yield val line = data.readline() data.close()
the create_values
function yields 1-5 strings, each forming single line in output. particular run i'm doing, each input file cloudstorage ~3mb or ~19k lines , each output ~14mb. have 64 of these files (one each shard) , processing time 4-10 minutes each shard.
now, know there lot of stuff going on behind scenes, me, 4 minutes of processing time a long time non-complex processing of 3mb file , output of 14mb , 10 minutes weird.
the same pattern of slow performance appears in steps of pipeline.
any tips or ideas fine-tuning performance mapreduce appreciated?
edit:
i noticed in appstats urlfetches cloudstorage slow during execution. screenshot shuffle phase:
Comments
Post a Comment