regex.md


        
                Markdown 3.56 KB
                                    
                        | None                    
                
                                        |
    0    0                            

            
                                    raw
                    download
                    clone
                    embed
                    print
                
                                    report
                
                
            Regular Expression, or regex, is a character string that represents a pattern matching, and possibly replacement, for another text string of interest.
Every major editor or programming language has its own implementation of how a regex engine works, also called flavour, PCRE (Perl[5] Compatible Regular Expression) being the one used by most of them (older flavours are BRE, ERE, EMACS, and VIM, while PSIX is the most recent developed by Perl 6). You can find a list and a comparison at this page


Literal Matches: when the search pattern is exactly equal to the source/target text.


Character Classes: wrapped in square brackets, matches any individual characters

Ranges are built using a dash between two ASCII characters, to mean any characters between the twon according to the ASCII table, included the .

Negations are addressed with a caret inside the brackets.

Some particular ranges are defined as following:

[:digits:]
[:alpha:]
[::]
[::]
[::]
[::]


Escape sequences (A sequence in a string that starts with a backslash \ is called an escape sequence, and allows us to include in any string special characters or characters that has otherwise another meaning; it usually refers to a single character metacharacter, see below):


\\ backslash


\r carriage return (ASCII Code 0x0A)


\n line feed, or newline (ASCII Code 0x0D)


\t tab


\0 null


\ space


\" double quote


\Ux where x is an 8 hex digit, denotes a particular Unicode character. 


\. 


\*


\?


\[


\]


\- dash, or just put the dash at the end of the class itself


Metacharacters represents shortcuts to an entire character class:

. any single character
\w ≡ [a-zA-Z0-9] any word character (alphanum + underscore)
\W ≡ [^a-zA-Z0-9] any non-word character 
\ ≡ [a-zA-Z]
\ ≡ [^a-zA-Z]
\d ≡ [0-9] any digit
\D ≡ [^0-9] any non-digit
\s ≡ [] any whitespace character: space, tab or newline
\S ≡ [^] any non-whitespace character, anything other than a space, tab or newline
``


Quantifiers: used with the above literal matches to address multiple characters at once

^x start of string (outside brackets)
x$ end of string (outside brackets)
x? zero or one of x
x* zero, one or more of x. In particular, .* means any number of any characters.
x+ one or more of x
x{n} exactly n of x. In particular, .{n} means any n characters.
x{n,} n or more of x
x{n1, n2} between n1 and n2 of x


Alternations: tentative matches starting with the first on the left and proceeding to the right until one pattern has been found 
[first|second|third|...]


In base R there exists a few different functions that activate a regex, that differ in the format of and amount of detail in the result


grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE)


grepl(pattern, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)


regexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)


gregexpr(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)


regexec(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)


sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)


gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)