Spark streaming creates one task per input file -

May 15, 2010

i processing sequence of input files spark streaming.

spark streaming creates 1 task per input file , corresponding no of partitions , output part files.

javapairinputdstream<text, customdatatype> myrdd =         jssc.filestream(path, text.class, customdatatype.class, sequencefileinputformat.class,             new function<path, boolean>() {           @override           public boolean call(path v1) throws exception {             return boolean.true;           }         }, false);

for example if there 100 input files in interval.

then there 100 part files in output file.

what each part file represents? (output task)

how reduce no of output files (2 or 4 ...)?

does depend on no of partitioners?

each file represents rdd partition. if want reduce number of partitions can call repartition or coalesce number of partitions wish have.

https://spark.apache.org/docs/1.3.1/programming-guide.html#transformations

Search This Blog

Script

Spark streaming creates one task per input file -

Comments

Post a Comment

Popular posts from this blog

Magento/PHP - Get phones on all members in a customer group -

javascript - Bootstrap Popover: iOS Safari strange behaviour -

spring cloud - How to configure SpringCloud Eureka instance to point to https on non standard port -