amazon web services - Reading many small files from S3 very slow -
loading many small files (>200000, 4kbyte) s3 bucket hdfs via hive or pig on aws emr extremely slow. seems 1 mapper used data, though cannot figure out bottleneck is.
pig code sample
data = load 's3://data-bucket/' using pigstorage(',') (line:chararray)
hive code sample
create external table data (value string) location 's3://data-bucket/';
are there known settings speed process or increase number of mappers used fetch data?
i tried following without noticeable effects:
- increase #task nodes
- set hive.optimize.s3.query=true
- manually set #mappers
- increase instance type medium xlarge
i know s3distcp speed process, better performance doing lot of tweaking including setting #workerthreads , prefer changing parameters directly in pig/hive scripts.
you can either :
use distcp merge file before job starts : http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-small-files-problem/
have pig script you, once.
if want through pig, need know how many mappers spawned. can play following parameters :
// set mapper = nb block size. set true 1 per file. set pig.nosplitcombination false; // set size have sum(size) / x = wanted number of mappers set pig.maxcombinedsplitsize 250000000;
please provide metrics thoses cases
Comments
Post a Comment