Help:Regexp

From Talk2000.NL

Jump to: navigation, search

externe links


Two special characters are used in almost all regular expression tools to mark the beginning and end of a line: caret (^) and dollarsign ($). To match a caret or dollarsign as a literal character, you must escape it (i.e. precede it by a backslash "\").

An interesting thing about the caret and dollarsign is that they match zero-width patterns. That is the length of the string matched by a caret or dollarsign by itself is zero (but the rest of the regular expression can still depend on the zero-width match). Many regular expression tools provide another zero-width pattern for word-boundary (\b). Words might be divided by whitespace like spaces, tabs, newlines, or other characters like nulls; the word-boundary pattern matches the actual point where a word starts or ends, not the particular whitespace characters.


In older Unix-oriented tools like grep, subexpressions must be grouped with escaped parentheses, e.g. /\(Mary\)/. In Perl and most more recent tools (including egrep), grouping is done with bare parentheses, but matching a literal parenthesis requires escaping it in the pattern (the example to the side follows Perl).


/[^a-z]a/

The caret symbol can actually have two different meanings in regular expressions. Most of the time, it means to match the zero-length pattern for line beginnings. But if it is used at the beginning of a character class, it reverses the meaning of the character class. Everything not included in the listed character set is matched.


The pipe character in a regular expression indicates an alternation between everything in the group enclosing it. What this means is that even if there are several groups to the left and right of a pipe character, the alternation greedily asks for everything on both sides. To select the scope of the alternation, you must define a group that encompasses the patterns that may match.


There is only one quantifier included with "basic" regular expression syntax, the asterisk ("*"); in English this has the meaning "some or none" or "zero or more." If you want to specify that any number of an atom may occur as part of a pattern, follow the atom by an asterisk.


Extended regular expressions (which most tools support) add a few other useful numbers to "once exactly" and "zero or more times." The plus-sign ("+") means "one or more times" and the question-mark ("?") means "zero or one times."


Non-greedy quantifiers have the same syntax as regular greedy ones, except with the quantifier followed by a question-mark. For example, a non-greedy pattern might look like: "/A[A-Z]*?B/". In English, this means "match an A, followed by only as many capital letters as are needed to find a B."


As a little mnemonic, it is nice to remember the word "gismo" (it even seems somehow appropriate). The most frequent modifiers are:

   * g - Match globally
   * i - Case-insensitive match
   * s - Treat string as single line
   * m - Treat string as multiple lines
   * o - Only compile pattern once

A group that should not also be treated as a back reference has a question-mark colon at the beginning of the group, as in "(?:pattern)." In fact, you can use this syntax even when your backreferences are in the search pattern itself.


There are two kinds of lookahead assertions: positive and negative. As you would expect, a positive assertion specifies that something does come next, and a negative one specifies that something does not come next. Emphasizing their connection with non-backreferenced groups, the syntax for lookahead assertions is similar: (?=pattern) for positive assertions, and (?!pattern) for negative assertions.

Personal tools