performance - R: efficiently identifying highest N values of variable Z by group X -
i have datatable looks this.
id <- c(rep("abc",4), rep("def",4), rep("ghi",5)) x <- c(rep(c(1,2,3,4),3),5) set.seed(1234) z <- runif(13,min=0, max =1) <- data.table(id, x, z) id x z 1: abc 1 0.113703411 2: abc 2 0.622299405 3: abc 3 0.609274733 4: abc 4 0.623379442 5: def 1 0.860915384 6: def 2 0.640310605 7: def 3 0.009495756 8: def 4 0.232550506 9: ghi 1 0.666083758 10: ghi 2 0.514251141 11: ghi 3 0.693591292 12: ghi 4 0.544974836 13: ghi 5 0.282733584
i'd produce dataframe has n highest values of z within each x subgroup. lets n 2. i'd end dataset looks this:
x id z 1: 1 def 0.8609154 2: 1 ghi 0.6660838 3: 2 def 0.6403106 4: 2 abc 0.6222994 5: 3 ghi 0.6935913 6: 3 abc 0.6092747 7: 4 abc 0.6233794 8: 4 ghi 0.5449748 9: 5 ghi 0.2827336
i've been using line achive it, i've found particularly slow when datatable large (i.e. on 1,500,000 lines or more.)
top_n <- 2 <- a[order(a$x, -a$z),] a_2 <- a[, head(.sd, top_n), by=x] a_2 x id z 1: 1 def 0.8609154 2: 1 ghi 0.6660838 3: 2 def 0.6403106 4: 2 abc 0.6222994 5: 3 ghi 0.6935913 6: 3 abc 0.6092747 7: 4 abc 0.6233794 8: 4 ghi 0.5449748 9: 5 ghi 0.2827336
any appreciated!
thanks!
this should faster .sd
n <- 2 indx <- a[order(-z), .i[seq_len(n)], = x]$v1 a[indx] # id x z # 1: def 1 0.8609154 # 2: ghi 1 0.6660838 # 3: ghi 3 0.6935913 # 4: abc 3 0.6092747 # 5: def 2 0.6403106 # 6: abc 2 0.6222994 # 7: abc 4 0.6233794 # 8: ghi 4 0.5449748 # 9: ghi 5 0.2827336 # 10: na na na
if need ordered result, should fast
setorder(a, x, -z) indx <- a[, .i[seq_len(n)], = x]$v1 a[indx] # id x z # 1: def 1 0.8609154 # 2: ghi 1 0.6660838 # 3: def 2 0.6403106 # 4: abc 2 0.6222994 # 5: ghi 3 0.6935913 # 6: abc 3 0.6092747 # 7: abc 4 0.6233794 # 8: ghi 4 0.5449748 # 9: ghi 5 0.2827336 # 10: na na na
Comments
Post a Comment