Last post we discussed the eagerness of regular expression (regex) engines with respect to the "|" metacharacter. Today we will look at another personification of the operation of regex engines, namely that of "greediness".
As a disclaimer, keep in mind that this post is covering regex-directed engines, not text-directed engines.
Regex-directed engines are "greedy" in that they will return the largest-possible regex match any given regex can afford the engine to return. In other words, the engine is "greedy" to provide us with the largest-possible match that a regex provided. The way this works out in practice, is that the right-most valid match for a token will be returned whenever greedy metacharacters are involved. Take for example the regex
".+"that is matched to the String 'I said, "Hello." In return, she said "Hi!"'. The returned match will be '"Hello." In return, she said "Hi!"', even though it would appear that the intention would have been to return either '"Hello."' or '"Hi!"'. The engine was greedy to return the largest match the supplied regex could muster given the provided String, therefore, moving left-to-right, as soon as the engine found a proper match of the second """regex token, the engine attempted to go even farther along the String in order to find yet another """ match. Only when the engine could not find another instance of """ in the String did it back-track in order to return the most recent (and greediest) match.
The above example is one of "repetition", or the use of the "+" character, informing the regex engine to match the preceding token one or more times. The preceding token was the "." special character, which means "any character except \n". Given this context, the "+" metacharacter will greedily match the largest String that is contained within two double-quote characters, including the double-quote characters.
Other repetition metacharacters include "*" and "?". The "*" metacharacter informs the regex engine to match the preceding token zero or more times. Notice the slight nuance between "*" and "+"; one will match zero or more times while the other will match one or more times, respectively. Meanwhile the "?" metacharacter will match the preceding token one time or not at all. While this is sometimes referred to as an optional character, it is technically still a form of repetition.
There are a number of ways to defeat the greedy nature of "*", "+", and "?".
- A "?" after a greedy metacharacter will override the greediness of the operator
- ".+?" will evaluate ""Hello."" before returning the initial evaluation
- Modify the token preceding the repetition metacharacter to be the negation of the intended stopping point
- "[^"]+" will also evaluate as ""Hello""