performance - Fastest way to find compute function on DataFrame slices by column value (Python pandas) -

May 15, 2011

i trying create column on data frame contains minimum of column (the value column), column b (the id column) has particular value. code slow. i'm looking faster way this. here little function:

def apply_by_id_value(df, id_col="id_col", val_col="val_col", offset_col="offset", f=min):     rid in set(df[id_col].values):         df.loc[df[id_col] == rid, offset_col] =  f(df[df[id_col] == rid][val_col])     return df

and example usage:

import pandas pd import numpy np # create data frame df = pd.dataframe({"id_col":[0, 0, 0, 1, 1, 1, 2, 2, 2],                     "val_col":[0.1, 0.2, 0.3, 0.6, 0.4, 0.5, 0.2, 0.1, 0.0]})  print df.head(10) # output    id_col  val_col 0       0      0.1 1       0      0.2 2       0      0.3 3       1      0.6 4       1      0.4 5       1      0.5 6       2      0.2 7       2      0.1 8       2      0.0  df = apply_by_id_value(df) print df.head(10)  # output     id_col  val_col  offset 0       0      0.1     0.1 1       0      0.2     0.1 2       0      0.3     0.1 3       1      0.6     0.4 4       1      0.4     0.4 5       1      0.5     0.4 6       2      0.2     0.0 7       2      0.1     0.0 8       2      0.0     0.0

some more context: in real data, "id_col" column has 30000 or more unique values. means data frame has sliced 30000 times. imagine bottleneck.

perform groupby on 'id_col' , transform passing function 'min', return series aligned original df can add new column:

in [13]:  df = pd.dataframe({"id_col":[0, 0, 0, 1, 1, 1, 2, 2, 2],                     "val_col":[0.1, 0.2, 0.3, 0.6, 0.4, 0.5, 0.2, 0.1, 0.0]}) df['offset'] = df.groupby('id_col').transform('min') df out[13]:    id_col  val_col  offset 0       0      0.1     0.1 1       0      0.2     0.1 2       0      0.3     0.1 3       1      0.6     0.4 4       1      0.4     0.4 5       1      0.5     0.4 6       2      0.2     0.0 7       2      0.1     0.0 8       2      0.0     0.0

timings

in [15]:  def apply_by_id_value(df, id_col="id_col", val_col="val_col", offset_col="offset", f=min):     rid in set(df[id_col].values):         df.loc[df[id_col] == rid, offset_col] =  f(df[df[id_col] == rid][val_col])     return df %timeit apply_by_id_value(df) %timeit df.groupby('id_col').transform('min') 100 loops, best of 3: 8.12 ms per loop 100 loops, best of 3: 5.99 ms per loop

so groupby , transform faster on dataset, expect faster on real dataset scale better.

for 800,000 row df following timings:

1 loops, best of 3: 611 ms per loop 1 loops, best of 3: 438 ms per loop

Search This Blog

Script

performance - Fastest way to find compute function on DataFrame slices by column value (Python pandas) -

Comments

Post a Comment

Popular posts from this blog

android - Sent Blob results empty -

javascript - Bootstrap Popover: iOS Safari strange behaviour -

ruby - How to configure keymap of Rubymine for rails console -