data.table - How to solve gaps and island problems in R and performance vs SQL? -

i wondering whether island , gaps problems can solved in r efficiently, similar sql. have following data available, if examine 1 id:

id startdate  starttime enddate      endtime  1  19-05-2014 19:00     19-05-2014   20:00 1  19-05-2014 19:30     19-05-2014   23:30 1  19-05-2014 16:00     19-05-2014   18:00 1  20-05-2014 20:00     20-05-2014   20:30

notice first 2 rows overlap, do, merge overlapping rows, resulting:

id startdate  starttime enddate      endtime  1  19-05-2014 19:00     19-05-2014   23:30 1  19-05-2014 16:00     19-05-2014   18:00 1  20-05-2014 20:00     20-05-2014   20:30

is there way in r?

i aware done in sql, since data in r, prefer in r. second, have questions regarding performance of finding gaps , islands, know sql fast in doing that, wonder whether r faster due data being in memory.

i use data.table this, don't know how.

update - response arun

i have created following test case contains every possible interval orientation.

dat <- structure(   list(id = c(1l, 1l, 1l, 1l, 1l, 1l),         stime = structure(c(as.posixct("2014-01-15 08:00:00"),                            as.posixct("2014-01-15 10:00:00"),                            as.posixct("2014-01-15 08:30:00"),                            as.posixct("2014-01-15 09:00:00"),                            as.posixct("2014-01-15 11:30:00"),                            as.posixct("2014-01-15 12:00:00")),                          class = c("posixct", "posixt"), tzone = ""),        etime = structure(c(as.posixct("2014-01-15 09:30:00"),                            as.posixct("2014-01-15 11:00:00"),                            as.posixct("2014-01-15 10:00:00"),                             as.posixct("2014-01-15 09:30:00"),                            as.posixct("2014-01-15 12:30:00"),                            as.posixct("2014-01-15 13:00:00")),                           class = c("posixct", "posixt"), tzone = "")   ),   .names = c("id", "stime", "etime"),   sorted = c("id", "stime", "etime"),   class = c("data.table", "data.frame"),   row.names = c(na,-6l) )

i expect interval 8:30 - 10:00 "glued" onto 10:00 - 11:00, not case. result was:

   idx id               stime               etime 1:   4  1 2014-01-15 08:00:00 2014-01-15 10:00:00 2:   3  1 2014-01-15 10:00:00 2014-01-15 11:00:00 3:   6  1 2014-01-15 11:30:00 2014-01-15 13:00:00

the following data set provides more thorough testing:

# numbers represent seconds 1970-01-01 01:00:01 dat <- structure(   list(id = c(1l, 1l, 1l, 1l, 1l, 1l, 2l, 2l, 2l, 2l, 2l, 2l, 2l),         stime = structure(c(as.posixct("2014-01-15 08:00:00"),                            as.posixct("2014-01-15 10:00:00"),                            as.posixct("2014-01-15 08:30:00"),                            as.posixct("2014-01-15 09:00:00"),                            as.posixct("2014-01-15 11:30:00"),                            as.posixct("2014-01-15 12:00:00"),                            as.posixct("2014-01-15 07:30:00"),                            as.posixct("2014-01-15 08:00:00"),                            as.posixct("2014-01-15 08:30:00"),                            as.posixct("2014-01-15 09:00:00"),                            as.posixct("2014-01-15 09:00:00"),                            as.posixct("2014-01-15 09:30:00"),                            as.posixct("2014-01-15 10:00:00")                            ),                          class = c("posixct", "posixt"), tzone = ""),        etime = structure(c(as.posixct("2014-01-15 09:30:00"),                            as.posixct("2014-01-15 11:00:00"),                            as.posixct("2014-01-15 10:00:00"),                             as.posixct("2014-01-15 09:30:00"),                            as.posixct("2014-01-15 12:30:00"),                            as.posixct("2014-01-15 13:00:00"),                            as.posixct("2014-01-15 08:30:00"),                            as.posixct("2014-01-15 09:00:00"),                            as.posixct("2014-01-15 09:30:00"),                            as.posixct("2014-01-15 10:00:00"),                            as.posixct("2014-01-15 10:00:00"),                            as.posixct("2014-01-15 10:30:00"),                            as.posixct("2014-01-15 11:00:00")                            ),                           class = c("posixct", "posixt"), tzone = "")   ),   .names = c("id", "stime", "etime"),   sorted = c("id", "stime", "etime"),   class = c("data.table", "data.frame"),   row.names = c(na,-6l) )

so our result is:

   idx id               stime               etime 1:   4  1 2014-01-15 08:00:00 2014-01-15 10:00:00 2:   3  1 2014-01-15 10:00:00 2014-01-15 11:00:00 3:   6  1 2014-01-15 11:30:00 2014-01-15 13:00:00 4:  12  2 2014-01-15 07:30:00 2014-01-15 09:30:00 5:  13  2 2014-01-15 09:00:00 2014-01-15 11:00:00

now respondent id=2, see intervals overlapping, not reported 1 interval. correct solution be:

   idx id               stime               etime 1:   ?  1 2014-01-15 08:00:00 2014-01-15 11:00:00 3:   ?  1 2014-01-15 11:30:00 2014-01-15 13:00:00 4:  ??  2 2014-01-15 07:30:00 2014-01-15 11:00:00

update - benchmarks , testing , large datasets

i have following dataset 1000 users, each having 500 durations, giving 0.5 million rows. can download dataset @ google drive, including solution in google drive.

sql server 2014 on laptop of 8gb ram, 64-bit, i5-4210u cpu @ 1.70ghz - 2.39ghz takes 5 seconds using solution provided itzik ben-gan in sql. 5 seconds excluding process of creating function. in addition, no indices created table whatsoever.

ps: use library(lubridate);

here's simple idea. order start time, find cumulative max of end time. once you've done that, overlap groups next start time still less or equal current cumulative max end time (all done id):

setorder(dat, id, stime) # ordering id unnecessary, it's prettier  dat[, etime.max := as.posixct(cummax(as.numeric(etime)), origin = '1970-01-01'), = id]  # find grouping of intervals (1:.n hack avoid warnings when .n=1) dat[, grp := cumsum(c(false, stime[2:.n] > etime.max[1:(.n-1)]))[1:.n], = id]  dat[, .(stime = min(stime), etime = max(etime)), = .(id, grp)][, grp := null][] #   id               stime               etime #1:  1 2014-01-15 08:00:00 2014-01-15 11:00:00 #2:  1 2014-01-15 11:30:00 2014-01-15 13:00:00 #3:  2 2014-01-15 07:30:00 2014-01-15 11:00:00

since doesn't need find possible overlaps, it's fast. on simulated data set matches op's description it's instantaneous me (< 0.2s).

Search This Blog

Script

data.table - How to solve gaps and island problems in R and performance vs SQL? -

Comments

Post a Comment