amazon s3 - Predetermining number of partitions of RDD -


1)how pre-determine number of rdd partitions created?
2)what factors partitioning of data depend on? size of data , way stored(compressed, sequence etc..)

for simplicity, assume have 6gb file in hdfs stored plain text file.

my cluster ec2 cluster below config,

1 master node - m3.xlarge(4 cores, 15gb ram)

4 core nodes - m3.xlarge(4 cores , 15gb ram each)

update: happens if same stored in s3, hbase or nosql?

the partitions dependent on file type. in case, since hdfs file, default number of partitions number of input splits , depend on hadoop setup. if want way of understanding how works.

from hadooprdd.getpartitions:

val inputsplits = inputformat.getsplits(jobconf, minpartitions) val array = new array[partition](inputsplits.size) 

Comments

Popular posts from this blog

Magento/PHP - Get phones on all members in a customer group -

php - Bypass Geo Redirect for specific directories -

php - .htaccess mod_rewrite for dynamic url which has domain names -