performance - Optimizing simple CPU-bound loops using Cython and replacing a list -
i trying evaluate approaches, , i'm hitting stumbling block performance.
why cython code slow?? expectation code run quite bit faster (maybe nano seconds 2d loop 256 ** 2 entries) opposed milliseconds.
here test results:
$ python setup.py build_ext --inplace; python test.py running build_ext counter: 0.00236220359802 sec pycounter: 0.00323309898376 sec percentage: 73.1 % my initial code looks this:
#!/usr/bin/env python # encoding: utf-8 # filename: loop_testing.py def generate_coords(dim, length): """generates list of coordinates dimensions , size provided. parameters: dim -- dimension length -- size of each dimension returns: list of coordinates based on dim , length """ values = [] if dim == 2: x in xrange(length): y in xrange(length): values.append((x, y)) if dim == 3: x in xrange(length): y in xrange(length): z in xrange(length): values.append((x, y, z)) return values this works need, slow. given dim, length = (2, 256), see timing on ipython of approximately 2.3ms.
in attempt speed up, developed cython equivalent (i think it's equivalent).
#!/usr/bin/env python # encoding: utf-8 # filename: loop_testing.pyx # cython: boundscheck=false # cython: wraparound=false cimport cython cython.parallel cimport prange import numpy np cimport numpy np ctypedef int dtype # 2d point updater cpdef inline void _counter_2d(dtype[:, :] narr, int val) nogil: cdef: dtype count = 0 dtype index = 0 dtype x, y x in range(val): y in range(val): narr[index][0] = x narr[index][1] = y index += 1 cpdef dtype[:, :] counter(dim=2, val=256): narr = np.zeros((val**dim, dim), dtype=np.dtype('i4')) _counter_2d(narr, val) return narr def pycounter(dim=2, val=256): vals = [] x in xrange(val): y in xrange(val): vals.append((x, y)) return vals and invocation of timing:
#!/usr/bin/env python # filename: test.py """ usage: test.py [options] test.py [options] <val> test.py [options] <dim> <val> options: -h --help message -n number of loops [default: 10] """ if __name__ == "__main__": docopt import docopt timeit import timer args = docopt(__doc__) dim = args.get("<dim>") or 2 val = args.get("<val>") or 256 n = args.get("-n") or 10 dim = int(dim) val = int(val) n = int(n) tests = ['counter', 'pycounter'] timing = {} test in tests: code = "{}(dim=dim, val=val)".format(test) variables = "dim, val = ({}, {})".format(dim, val) setup = "from loop_testing import {}; {}".format(test, variables) t = timer(code, setup=setup) timing[test] = t.timeit(n) / n test, val in timing.iteritems(): print "{:>20}: {} sec".format(test, val) print "{:>20}: {:>.3} %".format("percentage", timing['counter'] / timing['pycounter'] * 100) and reference, setup.py build cython code:
from distutils.core import setup cython.build import cythonize import numpy include_path = [numpy.get_include()] setup( name="looping", ext_modules=cythonize('loop_testing.pyx'), # accepts glob pattern include_dirs=include_path, ) edit: link working version: https://github.com/brianbruggeman/cython_experimentation
this cython code slow because of narr[index][0] = x assignment, relies heavily on python c-api. using, narr[index, 0] = x instead, translated pure c, , solves issue.
as pointed out @perimosocordiae, using cythonize annotations way go debug such issues.
in cases can worth explicitly specifying compilation flags in setup.py gcc,
setup( [...] extra_compile_args=['-o2', '-march=native'], extra_link_args=['-o2', '-march=native']) this should not necessary, assuming reasonable default compilation flags. however, instance, on linux system default appear no optimization @ , adding above flags, results in significant performance improvement.
Comments
Post a Comment