Friday, August 9, 2013

"^", "$" and Regular Expressions

My work in edX's Software as a Service (Saas) CS169.1x has required my investigation into Regular Expressions (regex or regexp for short).  Unfortunately the course does not cover regular expressions very extensively, so I have taken to the interwebs in order to grasp a deeper understanding of these finicky expressions.  My understanding is based off of the online resource regular-expressions.info.

First off, what is a regex? A regex is a pattern that describes a certain amount of text.  This pattern is formulated off of a series of rules and symbols that are an entire study unto itself.  At a basic level, a regex is composed of "tokens".  Each token describes the characteristics for a character or set of characters that are a subset (or the entirety) of the full text.  Regexes attempt to match a sequence of tokens to a set of text and returns the matches within the text, if any.

Tokens come in many shapes, sizes, and flavors that I will almost certainly touch base on in future blog posts.  Today I will be focusing on the "^" (or caret) and "$" tokens, or more specifically "special characters" or "metacharacters".

Although many are used to the "^" in the context of expressing exponents in mathematics (e.g. 10^2 representing ten to the second power), the "^" in the context of regex represents "before the first character of a string of text".  Therefore, given the string of text,
Hello, World!
the "^" position would be just before the "H".

Even more prevalent than the caret's contextual meaning in mathematics, the "$" is most commonly understood as the dollar-sign representation of money in text (e.g. $100.00).  However the "$" in the context of regex represents "after the last character of a string of text".  Again, visiting our string of text,
Hello, World!
the "$" position would be just after the "!".

To better understand how the regex engine is able to work on the text in this way, let us consider how a computer understands a string of text.  A string of text contains not only the characters in the text (e.g. the characters: "H","e","l","l","o"," ","W","o","r","l","d","!" from our previous example), but also include the INDEXES of those characters.

An index is an integer value representing the placement of any given character in a string.  For almost all computer languages, a string's starting point is the 0 index, and that index is before the first character.  The indexes then increment upwards from there (i.e. index 0, first character, then index 1, second character, the index 2, third character, and so on).  So a more accurate viewpoint to understand our example phrase, with the indexes included, is:
0H1e2l3l4o5,6 7W8o9r10l11d12!
 Given this understanding of the fundamental nature of strings of text as understood by a computer program, we can have a greater appreciation for what is happening as a regex is evaluating "^" and "$".  As we learned earlier, "^" refers to the position before the first character in a string of text.  We can understand the "^" of our example phrase as the zeroth index of the string.  Similarly, we can understand the "$", which refers to the position after the last character in a string of text, as the void index after the "!".

Let us now attempt to apply this understanding to evaluating regular expressions.  The regular expression ^H will attempt to match the character "H" at the beginning of a string of text.  Given the "Hello, World!" example string of text we have been working with, the regex ^H will match the "H" at the beginning of the string of text!  Alternatively, the regex ^e will attempt to match the character "e" at the beginning of a string of text.  Given our example, the regex will not return a match.  Although an "e" character exists in the string of text "Hello World!", the "e" is not found at the BEGINNING of the string.

We can apply this understanding similarly with the "$" metacharacter.  The regex !$ will attempt to match the character "!" at the end of a string of text.  Given the "Hello, World!" example string of text we have been working with, the regex !$ will match the "!" at the end of the string of text!  Alternatively, the regex d$ will attempt to match the character "d" a tthe end of the string of text.  Given our example, the regex will not return a match.  Although a "d" character exists in the string of text "Hello World!", the "d" is not found at the END of the string.

Many more thorough and overly verbose explanations of regular expression inner workings, special characters, and applications to follow.

No comments:

Post a Comment