performance - Fastest way to find compute function on DataFrame slices by column value (Python pandas) -
i trying create column on data frame contains minimum of column (the value column), column b (the id column) has particular value. code slow. i'm looking faster way this. here little function:
def apply_by_id_value(df, id_col="id_col", val_col="val_col", offset_col="offset", f=min): rid in set(df[id_col].values): df.loc[df[id_col] == rid, offset_col] = f(df[df[id_col] == rid][val_col]) return df and example usage:
import pandas pd import numpy np # create data frame df = pd.dataframe({"id_col":[0, 0, 0, 1, 1, 1, 2, 2, 2], "val_col":[0.1, 0.2, 0.3, 0.6, 0.4, 0.5, 0.2, 0.1, 0.0]}) print df.head(10) # output id_col val_col 0 0 0.1 1 0 0.2 2 0 0.3 3 1 0.6 4 1 0.4 5 1 0.5 6 2 0.2 7 2 0.1 8 2 0.0 df = apply_by_id_value(df) print df.head(10) # output id_col val_col offset 0 0 0.1 0.1 1 0 0.2 0.1 2 0 0.3 0.1 3 1 0.6 0.4 4 1 0.4 0.4 5 1 0.5 0.4 6 2 0.2 0.0 7 2 0.1 0.0 8 2 0.0 0.0 some more context: in real data, "id_col" column has 30000 or more unique values. means data frame has sliced 30000 times. imagine bottleneck.
perform groupby on 'id_col' , transform passing function 'min', return series aligned original df can add new column:
in [13]: df = pd.dataframe({"id_col":[0, 0, 0, 1, 1, 1, 2, 2, 2], "val_col":[0.1, 0.2, 0.3, 0.6, 0.4, 0.5, 0.2, 0.1, 0.0]}) df['offset'] = df.groupby('id_col').transform('min') df out[13]: id_col val_col offset 0 0 0.1 0.1 1 0 0.2 0.1 2 0 0.3 0.1 3 1 0.6 0.4 4 1 0.4 0.4 5 1 0.5 0.4 6 2 0.2 0.0 7 2 0.1 0.0 8 2 0.0 0.0 timings
in [15]: def apply_by_id_value(df, id_col="id_col", val_col="val_col", offset_col="offset", f=min): rid in set(df[id_col].values): df.loc[df[id_col] == rid, offset_col] = f(df[df[id_col] == rid][val_col]) return df %timeit apply_by_id_value(df) %timeit df.groupby('id_col').transform('min') 100 loops, best of 3: 8.12 ms per loop 100 loops, best of 3: 5.99 ms per loop so groupby , transform faster on dataset, expect faster on real dataset scale better.
for 800,000 row df following timings:
1 loops, best of 3: 611 ms per loop 1 loops, best of 3: 438 ms per loop
Comments
Post a Comment