python - Cythonizing string array comparison function to be applied to pandas DataFrame -
i getting started cython , appreciate pointers how approach process. have identified speed bottleneck in code , optimize performance of specific operation.
i have pandas dataframe trades
looks this:
codes price size time 2015-02-24 15:30:01-05:00 r6,is 11.6100 100 2015-02-24 15:30:01-05:00 r6,is 11.6100 100 2015-02-24 15:30:01-05:00 r6,is 11.6100 100 2015-02-24 15:30:01-05:00 11.6100 375 2015-02-24 15:30:01-05:00 r6,is 11.6100 100 ... ... ... ... 2015-02-24 15:59:55-05:00 r6,is 11.5850 100 2015-02-24 15:59:55-05:00 r6,is 11.5800 200 2015-02-24 15:59:55-05:00 t 11.5850 100 2015-02-24 15:59:56-05:00 r6,is 11.5800 175 2015-02-24 15:59:56-05:00 r6,is 11.5800 225 [5187 rows x 3 columns]
i have numpy
array called codes
:
array(['4', 'ap', 'cm', 'bp', 'fa', 'fi', 'nc', 'nd', 'ni', 'no', 'pt', 'pv', 'px', 'sd', 'wo'], dtype='|s2')
i need filter trades
such if of values in codes
included in trades['codes']
row excluded. doing this:
ix = trades.codes.str.split(',').apply(lambda cs: not any(c in excludes c in cs)) trades = trades[ix]
however, slow , need make faster. want use cython (as described here or maybe numba, seems cython better option.
i need function this:
def isincodes(codes_array1, codes_array2): x in codes_array1: y in codes_array2: if x == y: return true return false
what types need use when cythonizing?
this vectorizable.
construct frame, took 100000 * example, 1m rows.
in [76]: df2.info() <class 'pandas.core.frame.dataframe'> int64index: 1000000 entries, 0 9 data columns (total 4 columns): date 1000000 non-null datetime64[ns] code 900000 non-null object price 1000000 non-null float64 volume 1000000 non-null int64 dtypes: datetime64[ns](1), float64(1), int64(1), object(1) memory usage: 38.1+ mb in [77]: df2.head() out[77]: date code price volume 0 2015-02-24 20:30:01 r6,is 11.61 100 1 2015-02-24 20:30:01 r6,is 11.61 100 2 2015-02-24 20:30:01 r6,is 11.61 100 3 2015-02-24 20:30:01 nan 11.61 375 4 2015-02-24 20:30:01 r6,is 11.61 100
this code be: df2.code.str.split(',',expand=true)
, there perf issue atm, going fixed 0.16.2, see here. code splitting in performant way.
in [78]: result = dataframe([ [ s ] if not isinstance(s, list) else s s in df2.code.str.split(',') ],columns=['a','b']) in [79]: %timeit dataframe([ [ s ] if not isinstance(s, list) else s s in df2.code.str.split(',') ],columns=['a','b']) 1 loops, best of 3: 941 ms per loop in [80]: result.head() out[80]: b 0 r6 1 r6 2 r6 3 nan none 4 r6
i added 't' end of isin
in [81]: isin out[81]: ['4', 'ap', 'cm', 'bp', 'fa', 'fi', 'nc', 'nd', 'ni', 'no', 'pt', 'pv', 'px', 'sd', 'wo', 't']
results
in [82]: df2[(result.a.isin(isin) | result.a.isin(isin))].info() <class 'pandas.core.frame.dataframe'> int64index: 100000 entries, 7 7 data columns (total 4 columns): date 100000 non-null datetime64[ns] code 100000 non-null object price 100000 non-null float64 volume 100000 non-null int64 dtypes: datetime64[ns](1), float64(1), int64(1), object(1) memory usage: 3.8+ mb in [83]: df2[(result.a.isin(isin) | result.a.isin(isin))].head() out[83]: date code price volume 7 2015-02-24 20:59:55 t 11.585 100 7 2015-02-24 20:59:55 t 11.585 100 7 2015-02-24 20:59:55 t 11.585 100 7 2015-02-24 20:59:55 t 11.585 100 7 2015-02-24 20:59:55 t 11.585 100
the actual operation faster splitting here.
in [84]: %timeit df2[(result.a.isin(isin) | result.a.isin(isin))] 10 loops, best of 3: 106 ms per loop
Comments
Post a Comment