Spark streaming creates one task per input file -


i processing sequence of input files spark streaming.

spark streaming creates 1 task per input file , corresponding no of partitions , output part files.

javapairinputdstream<text, customdatatype> myrdd =         jssc.filestream(path, text.class, customdatatype.class, sequencefileinputformat.class,             new function<path, boolean>() {           @override           public boolean call(path v1) throws exception {             return boolean.true;           }         }, false); 

for example if there 100 input files in interval.

then there 100 part files in output file.

what each part file represents? (output task)

how reduce no of output files (2 or 4 ...)?

does depend on no of partitioners?

each file represents rdd partition. if want reduce number of partitions can call repartition or coalesce number of partitions wish have.

https://spark.apache.org/docs/1.3.1/programming-guide.html#transformations


Comments

Popular posts from this blog

Magento/PHP - Get phones on all members in a customer group -

php - .htaccess mod_rewrite for dynamic url which has domain names -

Website Login Issue developed in magento -