python - App Engine MapReduce performance optimization -


performance difficult subject, let me try anyway. i'm using app engine mapreduce straightforward analysis , feel i'm not getting kind of performance expect.

  • i have app engine module dedicated running single mapreduce pipeline.
  • the module instances use basic scaling, instance class b4_1g , 16 maximum instances.
  • the pipeline uses queue allows 100 concurrent requests.
  • the pipeline uses 64 shards.

as example of kind of performance i'm seeing, here 1 of map functions:

map(data):     """data handle cloudstorage file"""     line = data.readline()     while line:         val in creates_values(line):             yield val         line = data.readline()     data.close() 

the create_values function yields 1-5 strings, each forming single line in output. particular run i'm doing, each input file cloudstorage ~3mb or ~19k lines , each output ~14mb. have 64 of these files (one each shard) , processing time 4-10 minutes each shard.

now, know there lot of stuff going on behind scenes, me, 4 minutes of processing time a long time non-complex processing of 3mb file , output of 14mb , 10 minutes weird.

the same pattern of slow performance appears in steps of pipeline.

any tips or ideas fine-tuning performance mapreduce appreciated?

edit:

i noticed in appstats urlfetches cloudstorage slow during execution. screenshot shuffle phase:

enter image description here


Comments

Popular posts from this blog

Magento/PHP - Get phones on all members in a customer group -

php - Bypass Geo Redirect for specific directories -

php - .htaccess mod_rewrite for dynamic url which has domain names -