amazon web services - Reading many small files from S3 very slow -


loading many small files (>200000, 4kbyte) s3 bucket hdfs via hive or pig on aws emr extremely slow. seems 1 mapper used data, though cannot figure out bottleneck is.

pig code sample

data = load 's3://data-bucket/'  using pigstorage(',') (line:chararray) 

hive code sample

create external table data (value string) location  's3://data-bucket/'; 

are there known settings speed process or increase number of mappers used fetch data?

i tried following without noticeable effects:

  • increase #task nodes
  • set hive.optimize.s3.query=true
  • manually set #mappers
  • increase instance type medium xlarge

i know s3distcp speed process, better performance doing lot of tweaking including setting #workerthreads , prefer changing parameters directly in pig/hive scripts.

you can either :

  1. use distcp merge file before job starts : http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/

  2. have pig script you, once.

if want through pig, need know how many mappers spawned. can play following parameters :

//  set mapper = nb block size. set true 1 per file. set pig.nosplitcombination false; // set size have sum(size) / x = wanted number of mappers set pig.maxcombinedsplitsize 250000000; 

please provide metrics thoses cases


Comments

Popular posts from this blog

Magento/PHP - Get phones on all members in a customer group -

php - Bypass Geo Redirect for specific directories -

php - .htaccess mod_rewrite for dynamic url which has domain names -