java - Apache spark map-reduce explanation -
i'm wondering how works little snippet:
if have text:
ut quis pretium tellus. fusce quis suscipit ipsum. morbi viverra elit ut malesuada pellentesque. fusce eu ex quis urna lobortis finibus. integer aliquam faucibus neque id cursus. nulla non massa odio. fusce pretium felis felis, @ malesuada felis blandit nec. praesent ligula enim, gravida sit amet scelerisque eget, porta non mi. aenean vitae maximus tortor, ac facilisis orci.
and snippet code count occurences of each words on text above:
// load input data. javardd<string> input = sc.textfile(inputfile); // split words. javardd<string> words = input.flatmap(new flatmapfunction<string, string>() { public iterable<string> call(string x) { return arrays.aslist(x.split(" ")); } }); // transform word , count. javapairrdd<string, integer> counts = words.maptopair(new pairfunction<string, string, integer>() { public tuple2<string, integer> call(string x) { return new tuple2(x, 1); } }).reducebykey(new function2<integer, integer, integer>() { public integer call(integer x, integer y) { return x + y; } });
it's simple understand line
javardd<string> words = input.flatmap(new flatmapfunction<string, string>() { public iterable<string> call(string x) { return arrays.aslist(x.split(" ")); } });
creates dataset containing whole words splitted space
and line gives @ each tuple value of one, example:
javapairrdd<string, integer> counts = words.maptopair(new pairfunction<string, string, integer>() { public tuple2<string, integer> call(string x) { return new tuple2(x, 1);
ut,1
quis,1 //go on
i'm confused on how reducebykey
works, , how can count occurences of each words?
thanks in advance.
reducebykey
groups tuples key (first argument in each tuple) , makes reduce each of group.
like this:
(ut, 1), (quis, 1), ..., (quis, 1), ..., (quis, 1), ... maptopair
\ / | reducebykey + (quis, 1+1) | \ / \ / + (quis, 2+1)
Comments
Post a Comment