Introduction

Regular expressions can be used to specify text by its characteristics rather than by the exact characters. Regular expressions allow the specifications of such items through the use of a syntax borrowed from tools such as GREP, LEX and YACC.

The following syntax definition uses EBNF notation.

Informal description

A regular expression is composed of a sequence of sub-expressions, each of the form in the operators table below. The entire expression may be preceded by ^ to indicate that the expression is only matched at the start of a line, or ended by $ to indicate that the expression can only exist at the end of a line.

EBNF syntax specification

EBNF (Extended Backaus-Naur Form) is a style of specification used for formal syntax descriptions. Within the syntax description the following metacharacters are used:

a ::= b  Construct a is defined by construct b
[a]      Indicates that construct a is optional
(abc)    Indicates that constructs a, b and c are taken as a single construct
a|b      Indicates either construct a or construct b
'abc'    The literal characters 'a' followed by 'b' followed by 'c'
<a>      Single syntax construct a defined in the specification  

Regular expression operators

a+      One or more occurrences of a
a*      Zero or more occurrences of a
a?      Zero or one (i.e. optional) occurrence of a
a{n}    Exactly n occurrences of a
a{n,}   n or more occurrences of a
a{,m}   Zero or at most m occurrences of a
a{n,m}  At least n but not more than m occurrences of a
a|b     Either a or b
a||b    a or b or both a and b in any order
abc     a followed by b followed by c
[abc]   A single character, one of a or b or c

[a-b]   A single character, ranging in value from a to b inclusive
[^abc]  A single character, any except a, b or c
(abc)   a followed by b followed by c
"abc"   The letters a followed by b followed by c with no special significance attached to a, b or c
.       Any character except a newline
\a      The letter a, with no special significance attached to a, special forms:

\t      The tab character
\n      The newline character
\r      The return character
\f      The formfeed character
\b      The backspace character
\xNN    The hex character NN
\0ooo   The octal character ooo
\w      A single character, one of [a-zA-Z0-9_]
\W      Any single character not matching \w
\d      A single character [0-9]
\D      A single character not matching \d

\s      A whitespace character [\t\r\n\f\b\ ]
\S      A single character not matching \s   

Examples

Expression             Matches              Does not match
-----------------------------------------------------------
"this"|"that"          this                 This
                       that                 That

\d{2}\.\d{2}           23.45                2.4
                       03.22                0.1

[a-zA-Z_]\w*           Identifier           2Identifiers

\(\*[\x01-\x7F]+\*\)   (* a comment *)     ( No comment *)  

Regular expression examples

1. Locate Internet references

  ("http://"|"https://"|"mailto:"|"ftp://")[^ \n\r\"\<\\]+

Would allow the detection of internet references that start with 'http://', 'https://', 'mailto:' or 'ftp://'.

In english, the expression reads:

"Find all occurrences of text that start with 'http://', 'mailto:' or 'ftp://' and are followed by at least one character that is not one of a space (\s), a newline(\n), a carriage return(\r), a quote(\"), a bracket (\<), or a slash (\\)"

2. Locate all H1 HTML Tags

  (<[h|H]1>)(.+)(</[h|H]1>)

In english, the expression reads:

"Find all occurrences of text that starts with a open tag bracket and is followed by an 'h1' or 'H1', optionally followed by any number of any characters, then followed by a opened tag bracket, backslash, then followed by an 'h1' or 'H1' and a close tag bracket.

3. Locate all HTML Hex Colors

  #{1}([A-Fa-f0-9]{6}|[A-Fa-f0-9]{3})

In english, the expression reads:

Find a # character followed by six hex characters (A-F 0-9) or three hex characters (A-F 0-9).

4. Locate all HTML Entities

  &([^;\s])+;

In english, the expression reads:

Find a & character followed by any characters except ";" and whitespace, then followed by a ";" character.