sorting - python spark sort elements based on value -
i new python spark , need help, in advance that!
so here go, have piece of script:
from datetime import datetime pyspark import sparkcontext def getnormalizeddate(dateofcl): #the result in [0,1] dot=datetime.now() od=datetime.strptime("jan 01 2010", "%b %d %y") return (float((dateofcl-od).days)/float((dot-od).days)) def addition(a, b): a1=a b1=b if not type(a) float: a1=getnormalizeddate(a) if not type(b) float: b1=getnormalizeddate(b) return float(a1+b1) def debugfunction(x): print "x[0]: " + str(type(x[0])) print "x[1]: " + str(type(x[1])) + " --> " + str(x[1]) return x[1] if __name__ == '__main__': sc = sparkcontext("local", "file scores") textfile = sc.textfile("/data/spark/file.csv") #print "number of lines: " + str(textfile.count()) test1 = textfile.map(lambda line: line.split(";")) # result of this: # [u'01', u'01', u'add', u'filename', u'path', u'1', u'info', u'info2', u'info3', u'sep 24 2014'] test2 = test1.map(lambda line: (line[3], datetime.strptime(line[len(line)-1], "%b %d %y"))) test6=test2.reducebykey(addition) #print test6 test6.persist() result=sorted(test6.collect(), key=debugfunction)
this ends error:
traceback (most recent call last): file "/data/spark/script.py", line 40, in <module> result=sorted(test6.collect(), key=lambda x:x[1]) typeerror: can't compare datetime.datetime float
for info, test6.collect() gives content
[(u'file1', 0.95606060606060606), (u'file2', 0.91515151515151516), (u'file3', 0.8797979797979798), (u'file4', 0.0), (u'file5', 0.94696969696969702), (u'file6', 0.95606060606060606), (u'file7', 0.98131313131313136), (u'file8', 0.86161616161616161)]
and want sort based on float value (not key) how should proceed please?
thank guys.
for might interested, found problem. reducing key, , after performing addition of items contained in list of values. of files unique , won't affected reduction, still have date instead of float.
what
test2 = test1.map(lambda line: (line[3], line[len(line)-1])).map(getnormalizeddate)
that make pairs of (file, float)
only then, reduce key
finally, step
result=sorted(test6.collect(), key=lamba x:x[1])
gives me right sorting looking for.
i hope helps!!
Comments
Post a Comment