regexpr, gregexpr

The regexpr function is used to identify where a pattern is within a character vector, where each element is searched separately. The gregexpr function does the same thing, except that its returned object is a list rather than a vector. The functions return information sufficient to extract the pattern, unless the pattern is not found, then they return -1.
regexpr(pattern, text, ignore.case=FALSE)
  • pattern – A regular expressions pattern.
  • text – The character vector to be searched, where each element is searched separately.
  • ignore.case – Whether to ignore case in the search.
gregexpr(pattern, text, ignore.case=FALSE)
  • pattern – A regular expressions pattern.
  • text – The character vector to be searched, where each element is searched separately.
  • ignore.case – Whether to ignore case in the search.

Example. There are two HTML expressions below that are searched for information. In the first example, I retrieve just the main content shown (Home). Notice that I adjust the start and stop arguments in substr to get rid of the > and < symbols. In the second example, I extract three pieces of information: the link, alternative text, and the content. Each of these examples used brackets, which basically specify that I’m looking for that part of the pattern, I will settle for any of those items listed (where a dash must come first or last in the list, if it is to be included, and some other characters can also have special meanings), and then the braces’ content specifies the tolerance length of the bracket content.
The last example below shows the comparison of regexpr and gregexpr. The results are the same, except that gregexpr returns a list rather than a vector.
> x <- '<a href="index.php">Home</a>'
> y <- regexpr('>[A-Z0-9 ]{1,50}<', x, TRUE)
> y
[1] 20
attr(,"match.length")
[1] 6
> 
> z <- y + attr(y, "match.length")-1
> substr(x, y, z)
[1] ">Home<"
> 
> substr(x, y+1, z-1)
[1] "Home"
> 
> 
> x  <- '<a href="code/120221.R" alt="regexpr">Download Code</a>'
> y1 <- regexpr(">[A-Z0-9 ]{1,50}<", x, TRUE)
> z1 <- y1 + attr(y1, "match.length")-1
> substr(x, y1, z1)
[1] ">Download Code<"
> substr(x, y1+1, z1-1)
[1] "Download Code"
> 
> y2 <- regexpr('href="[A-Z0-9/._ -]{1,50}"', x, TRUE)
> z2 <- y2 + attr(y2, "match.length")-1
> substr(x, y2, z2)
[1] "href=\"code/120221.R\""
> substr(x, y2+6, z2-1)
[1] "code/120221.R"
> 
> y3 <- regexpr('alt="[A-Z0-9/._ -]{1,50}"', x, TRUE)
> z3 <- y3 + attr(y3, "match.length")-1
> substr(x, y3, z3)
[1] "alt=\"regexpr\""
> substr(x, y3+5, z3-1)
[1] "regexpr"
> 
> 
> x <- c("ABCDE", "CDEFG", "FGHIJ")
> regexpr("D", x)
[1]  4  2 -1
attr(,"match.length")
[1]  1  1 -1
> 
> gregexpr("D", x)
[[1]]
[1] 4
attr(,"match.length")
[1] 1

[[2]]
[1] 2
attr(,"match.length")
[1] 1

[[3]]
[1] -1
attr(,"match.length")
[1] -1
Tip. A helpful regular expressions guide may be found here. If you have an alternative recommendation, please send an email or post a link below, especially for ground-up tutorials.

Leave a Reply