A regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids. You are probably familiar with simple wildcard notations such as *.txt in windows explorer.
You can do much more with regular expressions than you can with simple wildcards. For example you could use the regular expression \b[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b to search for an email address - any email address.
The power of regular expressions comes with a learning curve - regular expressions can be difficult to master and complex to read. But in terms of search (and replace) functionality, they are difficult to beat and can turn simple Find and Replace functionality into a powerful content manipulation tool.
This topic contains just a few simple examples of regular expressions and is intended only as an introduction. More resources on regular expressions, including tutorials, can be found through a simple web search on the term "Regular Expression".
The simplest regular expressions contain just literal characters, e.g. a. A regular expression containing literal characters will match the first occurance of the specified character or characters.
There are 11 characters (metacharacters) with special meanings in a regular expression - [ \ ^ $ . | ? * + ( ). If you want to use any of these characters as a literal in a regex, escape them with a backslash. For example, to match the literal string Does 1 + 1 = 2? the correct regex would be Does 1 \+ 1 = 2\?.
A character class is a special construct that matches one out of several specified characters. For example st[oe]p would match either Stop or Step. The order of characters in the character class is not important. A character class matches only a single character so extending the previous example st[oe]p would not match steep or stoop.
You can use a hyphen in order to specify a range of characters. [0-9] matches a single digit between 0 and 9. You can use multiple ranges. [a-zA-Z] matches any alpha character. You can combine ranges and single characters. [a-zA-Z1] matches an alpha character or the number 1.
Typing a ^ character at the start of the character class (inside the opening square bracket) will negate the class - the character class will only match any character that is not in the character class. ch[^a]p would match chip and chop, but not chap.
There are a number of pre-defined shorthand character classes available for use in a regexp.
A regexp can contain character sequences to identify non-printable characters.
Non printable characters can be used directly in the regexp, or in character classes.
The dot matches any single character, except line break characters.
st.p matches step, stop, st%p etc.
Use the dot carefully. Often a character class (or negated character class) is more precise.
Anchors are used to match the start or end of the string.
Alternation is the regular expression "or" functionality. rain|shine will match rain in Do you think it will rain today? Perhaps the sun will shine. If the regexp is applied again, it will match shine.
The question mark makes the preceding token optional. E.g. stee?p matches step or steep.
The asterisk will match the preceding token zero or more times. The plus matches the preceding token once or more. <[A-Za-z][A-Za-z0-9]*> matches an HTML tag without any attributes. <[A-Za-z0-9]+> is easier to write but matches invalid tags such as <1>.
Use curly braces to specify a specific amount of repetition. Use \b[1-9][0-9]{3}\b to match a number between 1000 and 9999.
Repetition operators are "greedy". They will expand the match as far as possible. <.+> will match <STRONG>True</STRONG> in If the value is <STRONG>True</STRONG>.
Place a question mark after the quantifier to make it "lazy". <.+?> will match <STRONG> in the previous example.
Place brackets around multiple tokens to group them together. You can then apply a quantifier to the whole group. E.g. Fruit(fly)? matches Fruit or Fruitfly.
A lookaround is a special kind of group. The group is matched normally, but yields only the result not in the group. Lookaround therefore matches a position, similar to anchors.
st(?=e) matches the st in steer but not in stop. This is called a positive lookaround.
st(?!e) matches the st in stop but not the st in steer. This is called a negative lookaround.
To look backwards, use a backwards lookaround. (?<=t)op matches the op in top but not in pop.