Spark streaming creates one task per input file -
i processing sequence of input files spark streaming.
spark streaming creates 1 task per input file , corresponding no of partitions , output part files.
javapairinputdstream<text, customdatatype> myrdd = jssc.filestream(path, text.class, customdatatype.class, sequencefileinputformat.class, new function<path, boolean>() { @override public boolean call(path v1) throws exception { return boolean.true; } }, false);
for example if there 100 input files in interval.
then there 100 part files in output file.
what each part file represents? (output task)
how reduce no of output files (2 or 4 ...)?
does depend on no of partitioners?
each file represents rdd partition. if want reduce number of partitions can call repartition or coalesce number of partitions wish have.
https://spark.apache.org/docs/1.3.1/programming-guide.html#transformations
Comments
Post a Comment