Help:Regexp
From Talk2000.NL
externe links
|
Two special characters are used in almost all regular expression tools to mark the beginning and end of a line: caret (^) and dollarsign ($). To match a caret or dollarsign as a literal character, you must escape it (i.e. precede it by a backslash "\"). An interesting thing about the caret and dollarsign is that they match zero-width patterns. That is the length of the string matched by a caret or dollarsign by itself is zero (but the rest of the regular expression can still depend on the zero-width match). Many regular expression tools provide another zero-width pattern for word-boundary (\b). Words might be divided by whitespace like spaces, tabs, newlines, or other characters like nulls; the word-boundary pattern matches the actual point where a word starts or ends, not the particular whitespace characters. In older Unix-oriented tools like grep, subexpressions must be grouped with escaped parentheses, e.g. /\(Mary\)/. In Perl and most more recent tools (including egrep), grouping is done with bare parentheses, but matching a literal parenthesis requires escaping it in the pattern (the example to the side follows Perl).
The caret symbol can actually have two different meanings in regular expressions. Most of the time, it means to match the zero-length pattern for line beginnings. But if it is used at the beginning of a character class, it reverses the meaning of the character class. Everything not included in the listed character set is matched. The pipe character in a regular expression indicates an alternation between everything in the group enclosing it. What this means is that even if there are several groups to the left and right of a pipe character, the alternation greedily asks for everything on both sides. To select the scope of the alternation, you must define a group that encompasses the patterns that may match. There is only one quantifier included with "basic" regular expression syntax, the asterisk ("*"); in English this has the meaning "some or none" or "zero or more." If you want to specify that any number of an atom may occur as part of a pattern, follow the atom by an asterisk. Extended regular expressions (which most tools support) add a few other useful numbers to "once exactly" and "zero or more times." The plus-sign ("+") means "one or more times" and the question-mark ("?") means "zero or one times." Non-greedy quantifiers have the same syntax as regular greedy ones, except with the quantifier followed by a question-mark. For example, a non-greedy pattern might look like: "/A[A-Z]*?B/". In English, this means "match an A, followed by only as many capital letters as are needed to find a B." As a little mnemonic, it is nice to remember the word "gismo" (it even seems somehow appropriate). The most frequent modifiers are: * g - Match globally * i - Case-insensitive match * s - Treat string as single line * m - Treat string as multiple lines * o - Only compile pattern once A group that should not also be treated as a back reference has a question-mark colon at the beginning of the group, as in "(?:pattern)." In fact, you can use this syntax even when your backreferences are in the search pattern itself. There are two kinds of lookahead assertions: positive and negative. As you would expect, a positive assertion specifies that something does come next, and a negative one specifies that something does not come next. Emphasizing their connection with non-backreferenced groups, the syntax for lookahead assertions is similar: (?=pattern) for positive assertions, and (?!pattern) for negative assertions. |
