regex - two - stringr tutorial




R: fastest way to extract all substrings contained between two substrings (3)

I am on the lookout for an efficient way to extract all matches between two substrings in a character string. E.g. say I want to extract all substrings contained between string

start="strt"

and

stop="stp"
in string
x="strt111stpblablastrt222stp"

I would like to get vector

"111" "222"

What is the most efficient way to do this in R? Using a regular expression perhaps? Or are there better ways?


For something simple like this, base R handles this just fine.

You can switch on PCRE by using perl=T and use lookaround assertions.

x <- 'strt111stpblablastrt222stp'
regmatches(x, gregexpr('(?<=strt).*?(?=stp)', x, perl=T))[[1]]
# [1] "111" "222"

Explanation:

(?<=          # look behind to see if there is:
  strt        #   'strt'
)             # end of look-behind
.*?           # any character except \n (0 or more times)
(?=           # look ahead to see if there is:
  stp         #   'stp'
)             # end of look-ahead

EDIT: Updated below answers according to the new syntax.

You may also consider using the stringi package.

library(stringi)
x <- 'strt111stpblablastrt222stp'
stri_extract_all_regex(x, '(?<=strt).*?(?=stp)')[[1]]
# [1] "111" "222"

And rm_between from the qdapRegex package.

library(qdapRegex)
x <- 'strt111stpblablastrt222stp'
rm_between(x, 'strt', 'stp', extract=TRUE)[[1]]
# [1] "111" "222"

If you are talking about speed in R strings there is only one package to do this - stringi

 x <- "strt111stpblablastrt222stp"
 hwnd <- function(x1) regmatches(x1,gregexpr('(?<=strt).*?(?=stp)',x1,perl=T))
 Tim <- function(x1) regmatches(x1, gregexpr("(?<=strt)(?:(?!stp).)*", x1, perl=TRUE))
 stringr <- function(x1) str_extract_all(x1, perl('(?<=strt).*?(?=stp)'))
 akrun <- function(x1) genXtract(x1, "strt", "stp")
 stringi <- function(x1) stri_extract_all_regex(x1, perl('(?<=strt).*?(?=stp)'))

 require(microbenchmark)
 microbenchmark(stringi(x), hwnd(x), Tim(x), stringr(x))
Unit: microseconds
       expr     min       lq  median       uq     max neval
 stringi(x)  46.778  58.1030  64.017  67.3485 123.398   100
    hwnd(x)  61.498  73.1095  79.084  85.5190 111.757   100
     Tim(x)  60.243  74.6830  80.755  86.3370 102.678   100
 stringr(x) 236.081 261.9425 272.115 279.6750 440.036   100

Unfortunately I couldn't test @akrun solution because qdap package has some errors during installation. And only his solution looks like the one that can beat stringi...


You may also consider:

library(qdap)
unname(genXtract(x, "strt", "stp"))
#[1] "111" "222"

Speed comparison

 x1 <- rep(x,1e5)
 system.time(res1 <- regmatches(x1,gregexpr('(?<=strt).*?(?=stp)',x1,perl=T)))
 #   user  system elapsed 
 #  2.187   0.000   2.015 

 system.time(res2 <- regmatches(x1, gregexpr("(?<=strt)(?:(?!stp).)*", x1, perl=TRUE)))
 #user  system elapsed 
 #  1.902   0.000   1.780 

 system.time(res3 <- str_extract_all(x1, perl('(?<=strt).*?(?=stp)')))
 # user  system elapsed 
 #  6.990   0.000   6.636 

 system.time(res4 <- genXtract(x1, "strt", "stp")) ##setNames(genXtract(...), NULL) is a bit slower
 # user  system elapsed 
 # 1.457   0.000   1.414 

 names(res4) <- NULL
identical(res1,res4)
#[1] TRUE




substring