Tuesday, August 13, 2013

Regex-Directed Regular Expression Engine Eagerness (say that 5 times fast)

We may not typically think of our computer software as "eager", however the personification is very useful in understanding how regular expression (regex) engines process regular expressions (regexes).  For a very brief introduction to regex, see my previous post on the underpinnings of "$" and "^" metacharacters.

As a disclaimer, keep in mind that this post is covering regex-directed engines, not text-directed engines.

Regex-directed engines are "eager" in that they will return the first possible regex match any given regex can afford the engine to return.  In other words, the engine is "eager" to provide us with the first possible match that a regex provided.  The way this works out in practice, is that the left-most tokens in a regex that constitute a valid match will be returned before any other tokens of that regex are considered.  Take for example the regex
Ok|Okay
that is matched to the String "Okay".  The returned match will be "Ok", even though it would appear that the second token of the regex, Okay, would have been a better fit.  The engine was eager to match the regex to the provided String, therefore, moving left-to-right, as soon as the engine found a proper match of the regex token Ok with the first two letters of "Okay", the engine returned a match!

The above example is one of "alternation", or the use of the "|" character, informing the regex engine that one alternative or the other is an acceptable match for the regex.  In the case of alternation, there are two ways to protect against the engine being "overly eager" with your matches.

  1. Order your alternatives in such a way that defeats the engines eager nature
    1. Okay|Ok will evaluate "Okay" before ever evaluating "Ok"
  2. Utilize the "optional item" metacharacter "?", with grouping as necessary
    1. Ok(ay)? will first evaluate "Okay", and if it is not found, evaluate a match for "Ok" next.
As an important side-note with utilizing the alternation metacharacter, it is often helpful to consider that Strings outside of our alternatives may still match a given String.  For the above example, "Okaydokay" will still match "Okay".  To protect against this, it is useful to implement the word boundary operator, \b in such a way so as to protect our alternatives, such as: \bOk(ay)?\b

With eagerness covered, in my next post, I will turn my attention to another personification of regex-directed regular expression engines: greediness.

No comments:

Post a Comment