amazon s3 - Predetermining number of partitions of RDD -

June 15, 2011

1)how pre-determine number of rdd partitions created?
2)what factors partitioning of data depend on? size of data , way stored(compressed, sequence etc..)

for simplicity, assume have 6gb file in hdfs stored plain text file.

my cluster ec2 cluster below config,

1 master node - m3.xlarge(4 cores, 15gb ram)

4 core nodes - m3.xlarge(4 cores , 15gb ram each)

update: happens if same stored in s3, hbase or nosql?

the partitions dependent on file type. in case, since hdfs file, default number of partitions number of input splits , depend on hadoop setup. if want way of understanding how works.

from hadooprdd.getpartitions:

val inputsplits = inputformat.getsplits(jobconf, minpartitions) val array = new array[partition](inputsplits.size)

Search This Blog

Script

amazon s3 - Predetermining number of partitions of RDD -

Comments

Post a Comment

Popular posts from this blog

Magento/PHP - Get phones on all members in a customer group -

javascript - Bootstrap Popover: iOS Safari strange behaviour -

spring cloud - How to configure SpringCloud Eureka instance to point to https on non standard port -