amazon web services - Reading many small files from S3 very slow -

February 15, 2010

loading many small files (>200000, 4kbyte) s3 bucket hdfs via hive or pig on aws emr extremely slow. seems 1 mapper used data, though cannot figure out bottleneck is.

pig code sample

data = load 's3://data-bucket/'  using pigstorage(',') (line:chararray)

hive code sample

create external table data (value string) location  's3://data-bucket/';

are there known settings speed process or increase number of mappers used fetch data?

i tried following without noticeable effects:

increase #task nodes
set hive.optimize.s3.query=true
manually set #mappers
increase instance type medium xlarge

i know s3distcp speed process, better performance doing lot of tweaking including setting #workerthreads , prefer changing parameters directly in pig/hive scripts.

you can either :

use distcp merge file before job starts : http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/
have pig script you, once.

if want through pig, need know how many mappers spawned. can play following parameters :

//  set mapper = nb block size. set true 1 per file. set pig.nosplitcombination false; // set size have sum(size) / x = wanted number of mappers set pig.maxcombinedsplitsize 250000000;

please provide metrics thoses cases

Search This Blog

Script

amazon web services - Reading many small files from S3 very slow -

Comments

Post a Comment

Popular posts from this blog

javascript - Bootstrap Popover: iOS Safari strange behaviour -

spring cloud - How to configure SpringCloud Eureka instance to point to https on non standard port -

Magento/PHP - Get phones on all members in a customer group -