data.table - How to solve gaps and island problems in R and performance vs SQL? -
i wondering whether island , gaps problems can solved in r efficiently, similar sql. have following data available, if examine 1 id:
id startdate starttime enddate endtime 1 19-05-2014 19:00 19-05-2014 20:00 1 19-05-2014 19:30 19-05-2014 23:30 1 19-05-2014 16:00 19-05-2014 18:00 1 20-05-2014 20:00 20-05-2014 20:30 notice first 2 rows overlap, do, merge overlapping rows, resulting:
id startdate starttime enddate endtime 1 19-05-2014 19:00 19-05-2014 23:30 1 19-05-2014 16:00 19-05-2014 18:00 1 20-05-2014 20:00 20-05-2014 20:30 is there way in r?
i aware done in sql, since data in r, prefer in r. second, have questions regarding performance of finding gaps , islands, know sql fast in doing that, wonder whether r faster due data being in memory.
i use data.table this, don't know how.
update - response arun
i have created following test case contains every possible interval orientation.
dat <- structure( list(id = c(1l, 1l, 1l, 1l, 1l, 1l), stime = structure(c(as.posixct("2014-01-15 08:00:00"), as.posixct("2014-01-15 10:00:00"), as.posixct("2014-01-15 08:30:00"), as.posixct("2014-01-15 09:00:00"), as.posixct("2014-01-15 11:30:00"), as.posixct("2014-01-15 12:00:00")), class = c("posixct", "posixt"), tzone = ""), etime = structure(c(as.posixct("2014-01-15 09:30:00"), as.posixct("2014-01-15 11:00:00"), as.posixct("2014-01-15 10:00:00"), as.posixct("2014-01-15 09:30:00"), as.posixct("2014-01-15 12:30:00"), as.posixct("2014-01-15 13:00:00")), class = c("posixct", "posixt"), tzone = "") ), .names = c("id", "stime", "etime"), sorted = c("id", "stime", "etime"), class = c("data.table", "data.frame"), row.names = c(na,-6l) ) i expect interval 8:30 - 10:00 "glued" onto 10:00 - 11:00, not case. result was:
idx id stime etime 1: 4 1 2014-01-15 08:00:00 2014-01-15 10:00:00 2: 3 1 2014-01-15 10:00:00 2014-01-15 11:00:00 3: 6 1 2014-01-15 11:30:00 2014-01-15 13:00:00 the following data set provides more thorough testing:
# numbers represent seconds 1970-01-01 01:00:01 dat <- structure( list(id = c(1l, 1l, 1l, 1l, 1l, 1l, 2l, 2l, 2l, 2l, 2l, 2l, 2l), stime = structure(c(as.posixct("2014-01-15 08:00:00"), as.posixct("2014-01-15 10:00:00"), as.posixct("2014-01-15 08:30:00"), as.posixct("2014-01-15 09:00:00"), as.posixct("2014-01-15 11:30:00"), as.posixct("2014-01-15 12:00:00"), as.posixct("2014-01-15 07:30:00"), as.posixct("2014-01-15 08:00:00"), as.posixct("2014-01-15 08:30:00"), as.posixct("2014-01-15 09:00:00"), as.posixct("2014-01-15 09:00:00"), as.posixct("2014-01-15 09:30:00"), as.posixct("2014-01-15 10:00:00") ), class = c("posixct", "posixt"), tzone = ""), etime = structure(c(as.posixct("2014-01-15 09:30:00"), as.posixct("2014-01-15 11:00:00"), as.posixct("2014-01-15 10:00:00"), as.posixct("2014-01-15 09:30:00"), as.posixct("2014-01-15 12:30:00"), as.posixct("2014-01-15 13:00:00"), as.posixct("2014-01-15 08:30:00"), as.posixct("2014-01-15 09:00:00"), as.posixct("2014-01-15 09:30:00"), as.posixct("2014-01-15 10:00:00"), as.posixct("2014-01-15 10:00:00"), as.posixct("2014-01-15 10:30:00"), as.posixct("2014-01-15 11:00:00") ), class = c("posixct", "posixt"), tzone = "") ), .names = c("id", "stime", "etime"), sorted = c("id", "stime", "etime"), class = c("data.table", "data.frame"), row.names = c(na,-6l) ) so our result is:
idx id stime etime 1: 4 1 2014-01-15 08:00:00 2014-01-15 10:00:00 2: 3 1 2014-01-15 10:00:00 2014-01-15 11:00:00 3: 6 1 2014-01-15 11:30:00 2014-01-15 13:00:00 4: 12 2 2014-01-15 07:30:00 2014-01-15 09:30:00 5: 13 2 2014-01-15 09:00:00 2014-01-15 11:00:00 now respondent id=2, see intervals overlapping, not reported 1 interval. correct solution be:
idx id stime etime 1: ? 1 2014-01-15 08:00:00 2014-01-15 11:00:00 3: ? 1 2014-01-15 11:30:00 2014-01-15 13:00:00 4: ?? 2 2014-01-15 07:30:00 2014-01-15 11:00:00 update - benchmarks , testing , large datasets
i have following dataset 1000 users, each having 500 durations, giving 0.5 million rows. can download dataset @ google drive, including solution in google drive.
sql server 2014 on laptop of 8gb ram, 64-bit, i5-4210u cpu @ 1.70ghz - 2.39ghz takes 5 seconds using solution provided itzik ben-gan in sql. 5 seconds excluding process of creating function. in addition, no indices created table whatsoever.
ps: use library(lubridate);
here's simple idea. order start time, find cumulative max of end time. once you've done that, overlap groups next start time still less or equal current cumulative max end time (all done id):
setorder(dat, id, stime) # ordering id unnecessary, it's prettier dat[, etime.max := as.posixct(cummax(as.numeric(etime)), origin = '1970-01-01'), = id] # find grouping of intervals (1:.n hack avoid warnings when .n=1) dat[, grp := cumsum(c(false, stime[2:.n] > etime.max[1:(.n-1)]))[1:.n], = id] dat[, .(stime = min(stime), etime = max(etime)), = .(id, grp)][, grp := null][] # id stime etime #1: 1 2014-01-15 08:00:00 2014-01-15 11:00:00 #2: 1 2014-01-15 11:30:00 2014-01-15 13:00:00 #3: 2 2014-01-15 07:30:00 2014-01-15 11:00:00 since doesn't need find possible overlaps, it's fast. on simulated data set matches op's description it's instantaneous me (< 0.2s).
Comments
Post a Comment