performance - R: efficiently identifying highest N values of variable Z by group X -


i have datatable looks this.

id <- c(rep("abc",4), rep("def",4), rep("ghi",5)) x  <- c(rep(c(1,2,3,4),3),5) set.seed(1234) z  <- runif(13,min=0, max =1)   <- data.table(id, x, z)       id x           z  1: abc 1 0.113703411  2: abc 2 0.622299405  3: abc 3 0.609274733  4: abc 4 0.623379442  5: def 1 0.860915384  6: def 2 0.640310605  7: def 3 0.009495756  8: def 4 0.232550506  9: ghi 1 0.666083758 10: ghi 2 0.514251141 11: ghi 3 0.693591292 12: ghi 4 0.544974836 13: ghi 5 0.282733584 

i'd produce dataframe has n highest values of z within each x subgroup. lets n 2. i'd end dataset looks this:

   x  id         z 1: 1 def 0.8609154 2: 1 ghi 0.6660838 3: 2 def 0.6403106 4: 2 abc 0.6222994 5: 3 ghi 0.6935913 6: 3 abc 0.6092747 7: 4 abc 0.6233794 8: 4 ghi 0.5449748 9: 5 ghi 0.2827336 

i've been using line achive it, i've found particularly slow when datatable large (i.e. on 1,500,000 lines or more.)

top_n <- 2 <- a[order(a$x, -a$z),] a_2 <- a[, head(.sd, top_n), by=x] a_2     x  id         z 1: 1 def 0.8609154 2: 1 ghi 0.6660838 3: 2 def 0.6403106 4: 2 abc 0.6222994 5: 3 ghi 0.6935913 6: 3 abc 0.6092747 7: 4 abc 0.6233794 8: 4 ghi 0.5449748 9: 5 ghi 0.2827336 

any appreciated!

thanks!

this should faster .sd

n <- 2 indx <- a[order(-z), .i[seq_len(n)], = x]$v1 a[indx] #      id  x         z #  1: def  1 0.8609154 #  2: ghi  1 0.6660838 #  3: ghi  3 0.6935913 #  4: abc  3 0.6092747 #  5: def  2 0.6403106 #  6: abc  2 0.6222994 #  7: abc  4 0.6233794 #  8: ghi  4 0.5449748 #  9: ghi  5 0.2827336 # 10:  na na        na 

if need ordered result, should fast

setorder(a, x, -z) indx <- a[, .i[seq_len(n)], = x]$v1 a[indx] #      id  x         z #  1: def  1 0.8609154 #  2: ghi  1 0.6660838 #  3: def  2 0.6403106 #  4: abc  2 0.6222994 #  5: ghi  3 0.6935913 #  6: abc  3 0.6092747 #  7: abc  4 0.6233794 #  8: ghi  4 0.5449748 #  9: ghi  5 0.2827336 # 10:  na na        na 

Comments

Popular posts from this blog

Magento/PHP - Get phones on all members in a customer group -

php - Bypass Geo Redirect for specific directories -

php - .htaccess mod_rewrite for dynamic url which has domain names -