substr - Identify continuously occurring stretch of specific letters in a string using R -
i identify if string column in data frame below repeats letters "v" or "g" @ least 5 times within first 20 characters of string.
sample data:
data = data.frame(class = c('a','b','c'), string = c("asadsasavvvvgvgggsdasssdddfgdfghfghfgggggddffddfgdfgtyj", "aweertgvthrgefgdfsdfsggggggdawsdfaasdadaadwerweqwd", "grtvvggvvvggswergervgegddfasdggvqweqweqwereryryer"))
for example string in first row has "vvvvg" within first 20 character positions. string in third row has "vvggv".
data # class string #1 asadsasavvvvgvgggsdasssdddfgdfghfghfgggggddffddfgdfgtyj #2 b aweertgvthrgefgdfsdfsggggggdawsdfaasdadaadwerweqwd #3 c grtvvggvvvggswergervgegddfasdggvqweqweqwereryryer
the desired output should this:
# class string result # 1 asadsasavvvvgvgggsdasssdddfgdfghfghfgggggddffddfgdfgtyj true # 2 b aweertgvthrgefgdfsdfsggggggdawsdfaasdadaadwerweqwd false # 3 c grtvvggvvvggswergervgegddfasdggvqweqweqwereryryer true
similar akrun's
transform(data, result=grepl("[vg]{5,}", substr(string, 1, 20)))
produces
class string result 1 asadsasavvvvgvgggsdasssdddfgdfghfghfgggggddffddfgdfgtyj true 2 b aweertgvthrgefgdfsdfsggggggdawsdfaasdadaadwerweqwd false 3 c grtvvggvvvggswergervgegddfasdggvqweqweqwereryryer true
here use grep
combined character class matches either "g" or "v" ([vg]
) repeated 5 or more times ({5, }
). transform
creates new data frame either added or modified columns.
edit: benchmarks against matthew's creative answer:
set.seed(1) string <- vapply( replicate(1e5, sample(c("v", "g", "a", "s"), sample(20:300, 1), rep=t)), paste0, character(1l), collapse="" ) library(microbenchmark) microbenchmark( grepl("[vg]{5,}", substr(string, 1, 20)), grepl("^.{,15}[vg]{5,}", string), times=10 )
produces:
unit: milliseconds expr min lq mean grepl("[vg]{5,}", substr(string, 1, 20)) 131.6668 131.8343 133.6644 grepl("^.{,15}[vg]{5,}", string) 299.7326 300.4416 302.5065
wasn't entirely sure expect, guess makes sense since substr
simple apply. times close if pattern has 5 repeats near front of string.
Comments
Post a Comment