The regexec() function in R is similar to the regexpr() function, but it also returns the indices for parenthesized sub-expressions. This can be useful for extracting more detailed information from a regular expression match.
For example, the following code uses regexec() to match the date of a homicide in Baltimore:
> regexec("<dd>[F|f]ound on (.*?)</dd>", homicides[1])
[[1]]
[1] 177 190
attr(,"match.length")
[1] 33 15
attr(,"index.type")
[1] "chars"
attr(,"useBytes")
[1] TRUE
The output of regexec() is a list with two elements. The first element is the index of the overall match in the string, and the second element is the index of the parenthesized sub-expression. In this case, the parenthesized sub-expression matches the date of the homicide.
We can use the substr() function to extract the date from the string:
> substr(homicides[1], 190, 190 + 15 - 1)
[1] "January 1, 2007"
The regmatches() function can also be used to extract the date from the string. The following code uses regmatches() to extract the date from the first two homicides in the data set:
> r <- regexec("<dd>[F|f]ound on (.*?)</dd>", homicides[1:2])
> regmatches(homicides[1:2], r)
[[1]]
[1] "<dd>Found on January 1, 2007</dd>" "January 1, 2007"
[[2]]
[1] "<dd>Found on January 2, 2007</dd>" "January 2, 2007"
The output of regmatches() is a list with two elements for each homicide. The first element of each element is the overall match, and the second element is the parenthesized sub-expression. In this case, the parenthesized sub-expression matches the date of the homicide.
The regexec() and regmatches() functions are powerful tools for extracting information from strings using regular expressions.