Regular Expression, or regex, is a character string that represents a pattern matching, and possibly replacement, for another text string of interest.
Every major editor or programming language has its own implementation of how a regex engine works, also called flavour, PCRE (Perl[5] Compatible Regular Expression) being the one used by most of them (older flavours are BRE, ERE, EMACS, and VIM, while PSIX is the most recent developed by Perl 6). You can find a list and a comparison at this page
-
Literal Matches: when the search pattern is exactly equal to the source/target text.
-
Character Classes: wrapped in square brackets, matches any individual characters
Ranges are built using a dash between two ASCII characters, to mean any characters between the twon according to the ASCII table, included the .
Negations are addressed with a caret inside the brackets.
Some particular ranges are defined as following:- [:digits:]
- [:alpha:]
- [::]
- [::]
- [::]
- [::]
-
Escape sequences (A sequence in a string that starts with a backslash
\is called an escape sequence, and allows us to include in any string special characters or characters that has otherwise another meaning; it usually refers to a single character metacharacter, see below):-
\\backslash -
\rcarriage return (ASCII Code 0x0A) -
\nline feed, or newline (ASCII Code 0x0D) -
\ttab -
\0null -
\space -
\"double quote -
\Uxwherexis an 8 hex digit, denotes a particular Unicode character. -
\. -
\* -
\? -
\[ -
\] -
\-dash, or just put the dash at the end of the class itself
-
-
Metacharacters represents shortcuts to an entire character class:
.any single character\w≡[a-zA-Z0-9]any word character (alphanum + underscore)\W≡[^a-zA-Z0-9]any non-word character\≡[a-zA-Z]\≡[^a-zA-Z]\d≡[0-9]any digit\D≡[^0-9]any non-digit\s≡[]any whitespace character: space, tab or newline\S≡[^]any non-whitespace character, anything other than a space, tab or newline- ``
-
Quantifiers: used with the above literal matches to address multiple characters at once
^xstart of string (outside brackets)x$end of string (outside brackets)x?zero or one ofxx*zero, one or more ofx. In particular,.*means any number of any characters.x+one or more ofxx{n}exactlynofx. In particular,.{n}means any n characters.x{n,}nor more ofxx{n1, n2}betweenn1andn2ofx
-
Alternations: tentative matches starting with the first on the left and proceeding to the right until one pattern has been found
[first|second|third|...] -
In base R there exists a few different functions that activate a regex, that differ in the format of and amount of detail in the result
-
grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)
-
grepl(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
-
regexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
-
gregexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
-
regexec(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
-
sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
-
gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
-