The regexpr() function in R is used to find the first match of a regular expression in a character vector. The function returns the index into each string where the match begins and the length of the match for that string. regexpr() only gives you the first match of the string (reading left to right). gregexpr() will give you all of the matches in a given string if there are is more than one match.
For example, the following code uses regexpr() to find the first match of the regular expression <dd>[F|f]ound(.*?)</dd> in the first 10 strings in the homicides dataset:
> regexpr("<dd>[F|f]ound(.*?)</dd>", homicides[1:10])
[1] 177 178 188 189 178 182 178 187 182 183
attr(,"match.length")
[1] 93 86 89 90 89 84 85 84 88 84
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
The output of regexpr() is a vector of integers, where each integer represents the index into the string where the match begins. The length of the match is also returned as part of the output.
The ? metacharacter in the regular expression makes it “lazy” so that it stops at the first </dd> tag. This is in contrast to the previous pattern, which was too greedy and matched too much of the string.
The regmatches() function can be used to extract the matches in the strings for you without you having to use substr(). For example, the following code uses regmatches() to extract the matches from the first 5 strings in the homicides dataset:
> r <- regexpr("<dd>[F|f]ound(.*?)</dd>", homicides[1:5])
> regmatches(homicides[1:5], r)
[1] "<dd>Found on January 1, 2007</dd>" "<dd>Found on January 2, 2007</dd>"
[3] "<dd>Found on January 2, 2007</dd>" "<dd>Found on January 3, 2007</dd>"
[5] "<dd>Found on January 5, 2007</dd>"
The output of regmatches() is a list of vectors, where each vector contains the matches for a single string.