substr - Identify continuously occurring stretch of specific letters in a string using R -


i identify if string column in data frame below repeats letters "v" or "g" @ least 5 times within first 20 characters of string.

sample data:

 data = data.frame(class = c('a','b','c'), string =  c("asadsasavvvvgvgggsdasssdddfgdfghfghfgggggddffddfgdfgtyj",  "aweertgvthrgefgdfsdfsggggggdawsdfaasdadaadwerweqwd",  "grtvvggvvvggswergervgegddfasdggvqweqweqwereryryer")) 

for example string in first row has "vvvvg" within first 20 character positions. string in third row has "vvggv".

data #  class                                                  string #1     asadsasavvvvgvgggsdasssdddfgdfghfghfgggggddffddfgdfgtyj #2     b      aweertgvthrgefgdfsdfsggggggdawsdfaasdadaadwerweqwd #3     c       grtvvggvvvggswergervgegddfasdggvqweqweqwereryryer 

the desired output should this:

#   class                                                  string result # 1     asadsasavvvvgvgggsdasssdddfgdfghfghfgggggddffddfgdfgtyj   true # 2     b      aweertgvthrgefgdfsdfsggggggdawsdfaasdadaadwerweqwd  false # 3     c       grtvvggvvvggswergervgegddfasdggvqweqweqwereryryer   true 

similar akrun's

transform(data, result=grepl("[vg]{5,}", substr(string, 1, 20))) 

produces

  class                                                  string result 1     asadsasavvvvgvgggsdasssdddfgdfghfghfgggggddffddfgdfgtyj   true 2     b      aweertgvthrgefgdfsdfsggggggdawsdfaasdadaadwerweqwd  false 3     c       grtvvggvvvggswergervgegddfasdggvqweqweqwereryryer   true 

here use grep combined character class matches either "g" or "v" ([vg]) repeated 5 or more times ({5, }). transform creates new data frame either added or modified columns.


edit: benchmarks against matthew's creative answer:

set.seed(1) string <- vapply(   replicate(1e5, sample(c("v", "g", "a", "s"), sample(20:300, 1), rep=t)),   paste0, character(1l), collapse="" ) library(microbenchmark) microbenchmark(   grepl("[vg]{5,}", substr(string, 1, 20)),   grepl("^.{,15}[vg]{5,}", string),   times=10 ) 

produces:

unit: milliseconds                                      expr      min       lq     mean  grepl("[vg]{5,}", substr(string, 1, 20)) 131.6668 131.8343 133.6644          grepl("^.{,15}[vg]{5,}", string) 299.7326 300.4416 302.5065 

wasn't entirely sure expect, guess makes sense since substr simple apply. times close if pattern has 5 repeats near front of string.


Comments

Popular posts from this blog

Magento/PHP - Get phones on all members in a customer group -

php - Bypass Geo Redirect for specific directories -

php - .htaccess mod_rewrite for dynamic url which has domain names -