python - App Engine MapReduce performance optimization -

May 15, 2010

performance difficult subject, let me try anyway. i'm using app engine mapreduce straightforward analysis , feel i'm not getting kind of performance expect.

i have app engine module dedicated running single mapreduce pipeline.
the module instances use basic scaling, instance class b4_1g , 16 maximum instances.
the pipeline uses queue allows 100 concurrent requests.
the pipeline uses 64 shards.

as example of kind of performance i'm seeing, here 1 of map functions:

map(data):     """data handle cloudstorage file"""     line = data.readline()     while line:         val in creates_values(line):             yield val         line = data.readline()     data.close()

the create_values function yields 1-5 strings, each forming single line in output. particular run i'm doing, each input file cloudstorage ~3mb or ~19k lines , each output ~14mb. have 64 of these files (one each shard) , processing time 4-10 minutes each shard.

now, know there lot of stuff going on behind scenes, me, 4 minutes of processing time a long time non-complex processing of 3mb file , output of 14mb , 10 minutes weird.

the same pattern of slow performance appears in steps of pipeline.

any tips or ideas fine-tuning performance mapreduce appreciated?

edit:

i noticed in appstats urlfetches cloudstorage slow during execution. screenshot shuffle phase:

enter image description here

Search This Blog

Script

python - App Engine MapReduce performance optimization -

Comments

Post a Comment

Popular posts from this blog

javascript - Bootstrap Popover: iOS Safari strange behaviour -

Magento/PHP - Get phones on all members in a customer group -

spring cloud - How to configure SpringCloud Eureka instance to point to https on non standard port -