amazon s3 - Predetermining number of partitions of RDD -
1)how pre-determine number of rdd partitions created? 2)what factors partitioning of data depend on? size of data , way stored(compressed, sequence etc..) for simplicity, assume have 6gb file in hdfs stored plain text file. my cluster ec2 cluster below config, 1 master node - m3.xlarge(4 cores, 15gb ram) 4 core nodes - m3.xlarge(4 cores , 15gb ram each) update: happens if same stored in s3, hbase or nosql? the partitions dependent on file type. in case, since hdfs file, default number of partitions number of input splits , depend on hadoop setup. if want way of understanding how works. from hadooprdd.getpartitions : val inputsplits = inputformat.getsplits(jobconf, minpartitions) val array = new array[partition](inputsplits.size)