amazon s3 - Predetermining number of partitions of RDD -
1)how pre-determine number of rdd partitions created?
2)what factors partitioning of data depend on? size of data , way stored(compressed, sequence etc..)
for simplicity, assume have 6gb file in hdfs stored plain text file.
my cluster ec2 cluster below config,
1 master node - m3.xlarge(4 cores, 15gb ram)
4 core nodes - m3.xlarge(4 cores , 15gb ram each)
update: happens if same stored in s3, hbase or nosql?
the partitions dependent on file type. in case, since hdfs file, default number of partitions number of input splits , depend on hadoop setup. if want way of understanding how works.
from hadooprdd.getpartitions
:
val inputsplits = inputformat.getsplits(jobconf, minpartitions) val array = new array[partition](inputsplits.size)
Comments
Post a Comment