python - Cythonizing string array comparison function to be applied to pandas DataFrame -


i getting started cython , appreciate pointers how approach process. have identified speed bottleneck in code , optimize performance of specific operation.

i have pandas dataframe trades looks this:

                              codes    price  size time 2015-02-24 15:30:01-05:00     r6,is  11.6100   100 2015-02-24 15:30:01-05:00     r6,is  11.6100   100 2015-02-24 15:30:01-05:00     r6,is  11.6100   100 2015-02-24 15:30:01-05:00            11.6100   375 2015-02-24 15:30:01-05:00     r6,is  11.6100   100 ...                             ...      ...   ... 2015-02-24 15:59:55-05:00     r6,is  11.5850   100 2015-02-24 15:59:55-05:00     r6,is  11.5800   200 2015-02-24 15:59:55-05:00         t  11.5850   100 2015-02-24 15:59:56-05:00     r6,is  11.5800   175 2015-02-24 15:59:56-05:00     r6,is  11.5800   225  [5187 rows x 3 columns] 

i have numpy array called codes:

array(['4', 'ap', 'cm', 'bp', 'fa', 'fi', 'nc', 'nd', 'ni', 'no', 'pt',        'pv', 'px', 'sd', 'wo'],       dtype='|s2') 

i need filter trades such if of values in codes included in trades['codes'] row excluded. doing this:

ix = trades.codes.str.split(',').apply(lambda cs: not any(c in excludes c in cs)) trades = trades[ix] 

however, slow , need make faster. want use cython (as described here or maybe numba, seems cython better option.

i need function this:

def isincodes(codes_array1, codes_array2):      x in codes_array1:         y in codes_array2:             if x == y: return true      return false 

what types need use when cythonizing?

this vectorizable.

construct frame, took 100000 * example, 1m rows.

in [76]: df2.info() <class 'pandas.core.frame.dataframe'> int64index: 1000000 entries, 0 9 data columns (total 4 columns): date      1000000 non-null datetime64[ns] code      900000 non-null object price     1000000 non-null float64 volume    1000000 non-null int64 dtypes: datetime64[ns](1), float64(1), int64(1), object(1) memory usage: 38.1+ mb  in [77]: df2.head()    out[77]:                   date   code  price  volume 0 2015-02-24 20:30:01  r6,is  11.61     100 1 2015-02-24 20:30:01  r6,is  11.61     100 2 2015-02-24 20:30:01  r6,is  11.61     100 3 2015-02-24 20:30:01    nan  11.61     375 4 2015-02-24 20:30:01  r6,is  11.61     100 

this code be: df2.code.str.split(',',expand=true), there perf issue atm, going fixed 0.16.2, see here. code splitting in performant way.

in [78]: result = dataframe([ [ s ] if not isinstance(s, list) else s s in df2.code.str.split(',') ],columns=['a','b'])  in [79]: %timeit dataframe([ [ s ] if not isinstance(s, list) else s s in df2.code.str.split(',') ],columns=['a','b']) 1 loops, best of 3: 941 ms per loop  in [80]: result.head() out[80]:           b 0   r6    1   r6    2   r6    3  nan  none 4   r6    

i added 't' end of isin

in [81]: isin                      out[81]:  ['4',  'ap',  'cm',  'bp',  'fa',  'fi',  'nc',  'nd',  'ni',  'no',  'pt',  'pv',  'px',  'sd',  'wo',  't'] 

results

in [82]: df2[(result.a.isin(isin) | result.a.isin(isin))].info() <class 'pandas.core.frame.dataframe'> int64index: 100000 entries, 7 7 data columns (total 4 columns): date      100000 non-null datetime64[ns] code      100000 non-null object price     100000 non-null float64 volume    100000 non-null int64 dtypes: datetime64[ns](1), float64(1), int64(1), object(1) memory usage: 3.8+ mb  in [83]: df2[(result.a.isin(isin) | result.a.isin(isin))].head() out[83]:                   date code   price  volume 7 2015-02-24 20:59:55    t  11.585     100 7 2015-02-24 20:59:55    t  11.585     100 7 2015-02-24 20:59:55    t  11.585     100 7 2015-02-24 20:59:55    t  11.585     100 7 2015-02-24 20:59:55    t  11.585     100 

the actual operation faster splitting here.

in [84]: %timeit df2[(result.a.isin(isin) | result.a.isin(isin))]        10 loops, best of 3: 106 ms per loop 

Comments

Popular posts from this blog

Magento/PHP - Get phones on all members in a customer group -

php - Bypass Geo Redirect for specific directories -

php - .htaccess mod_rewrite for dynamic url which has domain names -