Thursday, August 29, 2013

Another Tongue-Twister: Regex-Directed Regular Expression Engine Greediness


Last post we discussed the eagerness of regular expression (regex) engines with respect to the "|" metacharacter.  Today we will look at another personification of the operation of regex engines, namely that of "greediness".

As a disclaimer, keep in mind that this post is covering regex-directed engines, not text-directed engines.

Regex-directed engines are "greedy" in that they will return the largest-possible regex match any given regex can afford the engine to return.  In other words, the engine is "greedy" to provide us with the largest-possible match that a regex provided.  The way this works out in practice, is that the right-most valid match for a token will be returned whenever greedy metacharacters are involved.  Take for example the regex
".+"
that is matched to the String 'I said, "Hello."  In return, she said "Hi!"'.  The returned match will be '"Hello." In return, she said "Hi!"', even though it would appear that the intention would have been to return either '"Hello."' or '"Hi!"'.  The engine was greedy to return the largest match the supplied regex could muster given the provided String, therefore, moving left-to-right, as soon as the engine found a proper match of the second """regex token, the engine attempted to go even farther along the String in order to find yet another """ match.  Only when the engine could not find another instance of """ in the String did it back-track in order to return the most recent (and greediest) match.

The above example is one of "repetition", or the use of the "+" character, informing the regex engine to match the preceding token one or more times.  The preceding token was the "." special character, which means "any character except \n".  Given this context, the "+" metacharacter will greedily match the largest String that is contained within two double-quote characters, including the double-quote characters.

Other repetition metacharacters include "*" and "?".  The "*" metacharacter informs the regex engine to match the preceding token zero or more times.  Notice the slight nuance between "*" and "+"; one will match zero or more times while the other will match one or more times, respectively.  Meanwhile the "?" metacharacter will match the preceding token one time or not at all.  While this is sometimes referred to as an optional character, it is technically still a form of repetition.  

There are a number of ways to defeat the greedy nature of "*", "+", and "?".
  1. A "?" after a greedy metacharacter will override the greediness of the operator
    1. ".+?" will evaluate ""Hello."" before returning the initial evaluation
  2. Modify the token preceding the repetition metacharacter to be the negation of the intended stopping point
    1. "[^"]+" will also evaluate as ""Hello""
We have now covered the eager and greedy metacharacters of regex-directed regular expression engines.  These personified characteristics of regex engines are important to keep in mind when a regex is matching too little or too much when eagerness and greediness, respectively, are concerned.

Tuesday, August 13, 2013

Regex-Directed Regular Expression Engine Eagerness (say that 5 times fast)

We may not typically think of our computer software as "eager", however the personification is very useful in understanding how regular expression (regex) engines process regular expressions (regexes).  For a very brief introduction to regex, see my previous post on the underpinnings of "$" and "^" metacharacters.

As a disclaimer, keep in mind that this post is covering regex-directed engines, not text-directed engines.

Regex-directed engines are "eager" in that they will return the first possible regex match any given regex can afford the engine to return.  In other words, the engine is "eager" to provide us with the first possible match that a regex provided.  The way this works out in practice, is that the left-most tokens in a regex that constitute a valid match will be returned before any other tokens of that regex are considered.  Take for example the regex
Ok|Okay
that is matched to the String "Okay".  The returned match will be "Ok", even though it would appear that the second token of the regex, Okay, would have been a better fit.  The engine was eager to match the regex to the provided String, therefore, moving left-to-right, as soon as the engine found a proper match of the regex token Ok with the first two letters of "Okay", the engine returned a match!

The above example is one of "alternation", or the use of the "|" character, informing the regex engine that one alternative or the other is an acceptable match for the regex.  In the case of alternation, there are two ways to protect against the engine being "overly eager" with your matches.

  1. Order your alternatives in such a way that defeats the engines eager nature
    1. Okay|Ok will evaluate "Okay" before ever evaluating "Ok"
  2. Utilize the "optional item" metacharacter "?", with grouping as necessary
    1. Ok(ay)? will first evaluate "Okay", and if it is not found, evaluate a match for "Ok" next.
As an important side-note with utilizing the alternation metacharacter, it is often helpful to consider that Strings outside of our alternatives may still match a given String.  For the above example, "Okaydokay" will still match "Okay".  To protect against this, it is useful to implement the word boundary operator, \b in such a way so as to protect our alternatives, such as: \bOk(ay)?\b

With eagerness covered, in my next post, I will turn my attention to another personification of regex-directed regular expression engines: greediness.

Friday, August 9, 2013

"^", "$" and Regular Expressions

My work in edX's Software as a Service (Saas) CS169.1x has required my investigation into Regular Expressions (regex or regexp for short).  Unfortunately the course does not cover regular expressions very extensively, so I have taken to the interwebs in order to grasp a deeper understanding of these finicky expressions.  My understanding is based off of the online resource regular-expressions.info.

First off, what is a regex? A regex is a pattern that describes a certain amount of text.  This pattern is formulated off of a series of rules and symbols that are an entire study unto itself.  At a basic level, a regex is composed of "tokens".  Each token describes the characteristics for a character or set of characters that are a subset (or the entirety) of the full text.  Regexes attempt to match a sequence of tokens to a set of text and returns the matches within the text, if any.

Tokens come in many shapes, sizes, and flavors that I will almost certainly touch base on in future blog posts.  Today I will be focusing on the "^" (or caret) and "$" tokens, or more specifically "special characters" or "metacharacters".

Although many are used to the "^" in the context of expressing exponents in mathematics (e.g. 10^2 representing ten to the second power), the "^" in the context of regex represents "before the first character of a string of text".  Therefore, given the string of text,
Hello, World!
the "^" position would be just before the "H".

Even more prevalent than the caret's contextual meaning in mathematics, the "$" is most commonly understood as the dollar-sign representation of money in text (e.g. $100.00).  However the "$" in the context of regex represents "after the last character of a string of text".  Again, visiting our string of text,
Hello, World!
the "$" position would be just after the "!".

To better understand how the regex engine is able to work on the text in this way, let us consider how a computer understands a string of text.  A string of text contains not only the characters in the text (e.g. the characters: "H","e","l","l","o"," ","W","o","r","l","d","!" from our previous example), but also include the INDEXES of those characters.

An index is an integer value representing the placement of any given character in a string.  For almost all computer languages, a string's starting point is the 0 index, and that index is before the first character.  The indexes then increment upwards from there (i.e. index 0, first character, then index 1, second character, the index 2, third character, and so on).  So a more accurate viewpoint to understand our example phrase, with the indexes included, is:
0H1e2l3l4o5,6 7W8o9r10l11d12!
 Given this understanding of the fundamental nature of strings of text as understood by a computer program, we can have a greater appreciation for what is happening as a regex is evaluating "^" and "$".  As we learned earlier, "^" refers to the position before the first character in a string of text.  We can understand the "^" of our example phrase as the zeroth index of the string.  Similarly, we can understand the "$", which refers to the position after the last character in a string of text, as the void index after the "!".

Let us now attempt to apply this understanding to evaluating regular expressions.  The regular expression ^H will attempt to match the character "H" at the beginning of a string of text.  Given the "Hello, World!" example string of text we have been working with, the regex ^H will match the "H" at the beginning of the string of text!  Alternatively, the regex ^e will attempt to match the character "e" at the beginning of a string of text.  Given our example, the regex will not return a match.  Although an "e" character exists in the string of text "Hello World!", the "e" is not found at the BEGINNING of the string.

We can apply this understanding similarly with the "$" metacharacter.  The regex !$ will attempt to match the character "!" at the end of a string of text.  Given the "Hello, World!" example string of text we have been working with, the regex !$ will match the "!" at the end of the string of text!  Alternatively, the regex d$ will attempt to match the character "d" a tthe end of the string of text.  Given our example, the regex will not return a match.  Although a "d" character exists in the string of text "Hello World!", the "d" is not found at the END of the string.

Many more thorough and overly verbose explanations of regular expression inner workings, special characters, and applications to follow.

Friday, August 2, 2013

Udacity Introduction to Java Programming Complete: Complete Cheat Sheet

Yesterday I completed Udacity's Introduction to Programming course, which had a specific focus on the Java language.  Throughout the course, I kept notes for very specific concepts and language keywords that I believed would assist me on key understanding and on the completion of course assignments.  The below is a transcript of those notes as a reference and cheat sheet.

NOTE: These are my definitions and understanding.  My understanding may not be completely correct or complete.

Class - a set of objects that have the same properties
Object - a data structure with defined variables and methods
Variable - a defined data instance
Declaration - the initialization of a variable
Assignment - the process of declaring or changing a variable's value
Constructor - a specific method used when initializing a new object from a Class
Accessor - a method that leaves an object's variables unchanged
Mutator - a method that changes an object's variables
private - a method or variable accessible by the object only
public - a method or variable accessible through interfacing with the object
Instance Variable - the variables within each instance of an object
this - an undeclared variable reference to an object
Local Variable - the variables declared within the body of a method
final - used to define constants
Scanner: - a predefined class within Java, requires the import java.util.scanner;statement
               - declaration: Scanner var = new Scanner(System.in);
               - example method: int i = var.nextInt() //returns the next int input by user
               - double d = var.nextDouble() //returns the next double input by user
Sentinel Value - a value that is intended to terminate a user input loop (i.e. "-1" or "done")
NaN - "Not A Number", i.e. an undefined result sometimes thrown at runtime
Double.MIN_VALUE - the smallest positive number a double can represent
-Double.MAX_VALUE - the largest (vector) negative number a double can represent
Reading From Files: - File declaration: File inputFile = new File("input.txt");
                                 - Scanner declaration: Scanner in = new Scanner(inputFile);
ArrayLists: - declaration: ArrayList<TYPE> varName = new ArrayList<TYPE>();
                   - example method: varName.add(varPointingToTYPE);
                   - example method: varName.set(index,varPointingToTYPE);
                   - example method: TYPE foo = varName.get(index);
                   - example method: int size = varName.size();
                   - example method: TYPE foo = varName.remove(index);
                   - for loop on each object in ArrayList: for (TYPE foo : varName) {...}
Properties - instance variables that have get/set methods
Static Methods: - a method that is not called on any object
                         - Class definition:
public class Foo
{
     public static int addition (int a, int b)
         {
              return a+b;
         }
}
                         - call: Foo.addition(int1,int2); //requires Foo Class to be imported in file
Static Variables: - a variable that is not called on any object, used best for constants
                           - Class definition:
public class Foo     public static final int ONE = 1; }
                           - variable call: int a = Foo.ONE; //requires Foo Class be imported in file
Coupling: - the concept that one Class's compilation and execution is dependent upon another Class
                 - example: Class A is coupled to Class B if Class A instantiates an instance of Class B
Interface Type: - a type of structure that specifies behavior for any Class that implements it
                         - declaration: public interface Drawable { void draw(); }
                         - implementation:
public class Picture implements Drawable{     //Constructor...
     public void draw() {...} //MUST implement draw method from Drawable}
                          - use case: if Picture and Class Photo both implement Drawable, then
ArrayList<Drawable> elements = new ArrayList<Drawable>();
element.add(new Picture());
element.add(new Photo());
for (Drawable obj : elements) { obj.draw(); }
Polymorphism - the Object Oriented Programming concept that a single method call on a generic object variable can handle many different Object types (i.e. interfaces)
Encapsulation - the Object Oriented Programming concept that objects can inherit variables and methods from parent Classes
instanceof: - keyword to determine if an element is an instance of a given Class
                  - example: if (stringVar instanceof String) {..} else {...}
Inheritance: - a Class may extend the behavior of only one parent class
                   - Example:
public class Mammal
{
    private String name;
    public Mammal ()
    {
         name = "";
    }
}
public class Dog extends Mammal
{
    private int legs;
    public Dog ()
    {
         super();
         legs = 0;
    }
}
Superclass - the parent of a Class (e.g. Mammal is Dog's superclass)
Subclass - the child of a Class (e.g. Dog is Mammal's subclass)
Overloading - when two or more methods have the same name but different parameter types
Overriding - when a subclass has a different implementation of a method that is also in its superclass

Thursday, August 1, 2013

Approach to Understanding Legal vs Illegal Class-Type Variable Assignments to Subclasses and Superclasses

Whew, what a jargon-filled post title!

Before getting started in this post, let us understand the post title in greater detail so that we are all on the same page:

Approach to Understanding Legal vs Illegal Class-Type Variable Assignments to Subclasses and Superclasses

Focusing on "Class-Type Variable Assignments": this is the assignment of a variable that is of some object-type defined by a class, i.e.

Foo foo1 = new Foo();

In the above code segment, the variable foo1 is defined as having Class-type Foo and is assigned a new object instance of Class-type Foo

Moving back to the post title, we will now focus on the idea of "Class-Type Variable Assignments" -that we just covered- "to Subclasses and Superclasses".  To understand this, let's focus again on our example:


Foo foo1 = new Foo();

Here our Class-type variable (which is of type Foo) matches the instance assigned to it (again, it is of type Foo).  Since the Class-type variable and the instance share the same Class, as opposed to subclass or superclass to one-another, let's consider a different example.  In the following example, assume that Class Dog is a subclass of Class Mammal.

Mammal animal1 = new Dog();

In the above example, we see that the Class-type variable animal1 of type Mammal is being assigned to an instance of the Dog Class, which is a subclass of Mammal (i.e. Mammal is the superclass of Dog).  This happens to be a legal assignment, but I am getting ahead of myself.  Consider the converse example:

Dog animal2 = new Mammal();

In the above example, we see that the Class-type variable animal2 of type Dog is being assigned to an instance of the Mammal Class, which is a superclass of Dog.  This happens to be an illegal assignment, but again I am getting ahead of myself.  The take-away at this point we should now understand what is meant by "Class-Type Variable Assignments to Subclasses and Superclasses".

As already noted, some of these assignments are legal, and some are illegal.  Therefore, putting it all together, this post is intended to explain my approach to understanding which of these assignments are legal and which are illegal.  After that verbose intro, let's get started!


In order to explain my approach, we'll keep with my previous example of Dog and Mammal: Class Mammal is a superclass of Dog, i.e. Class Dog extends Mammal, i.e. Dog is a subclass of Mammal.

Given this hierarchy of inheretance, I was able to come up with the following approach that makes a lot of sense to me, and can hopefully assist others:

  • When creating an instance of the Mammal class, the resultant Object knows everything there is to know about being a Mammal and expects to point to one.
  • When creating an instance of the Dog class, the resultant Object knows everything there is to know about being a Dog AND a Mammal and expects to point to both.
    • This behavior is due to the Dog being a subclass of the Mammal Class (Dog extends Mammal)
Keeping these ideas in mind,
Given:
Mammal animal1 = new Mammal();
Dog animal2 = new Dog();

consider the following cases:

  • animal1 = animal2;
  • animal2 = animal1;
  • animal2 = new Mammal();
  • animal1 = new Dog();
Case: animal1 = animal2;
Here, we are assigning a Dog (animal2) to a Mammal (animal1), i.e. we are pointing the Mammal to an object that knows everything there is to know about being a Mammal AND a Dog. For animal1, it is satisfied because its expectation to point to a Mammal is fulfilled! Therefore, the operation is legal.
Case: animal2 = animal1;

Here, we are assigning a Mammal to a Dog, i.e. we are pointing the Dog to an object that knows everything there is to know about being a Mammal ONLY. For animal2, it is unsatisfied because its expectation to point to a Mammal AND a Dog is unfulfilled! Therefore, the operation is illegal.

Case: animal2 = new Mammal();

Here, we are assigning a Mammal to a Dog, i.e. we are pointing the Dog to an object that knows everything there is to know about being a Mammal ONLY. For animal2, it is unsatisfied because its expectation to point to a Mammal AND a Dog is unfulfilled! Therefore, the operation is illegal.

Case: animal1 = new Dog();

Here, we are assigning a Dog to a Mammal, i.e. we are pointing the Mammal to an object that knows everything there is to know about being a Mammal AND a Dog. For animal1, it is satisfied because its expectation to point to a Mammal is fulfilled! Therefore, the operation is legal.

In Summary:
To ensure a variable assignment to an instance of a Class is legal, ensure the Class-type of the variable is of the same Class or some superclass of the instance's Class.

An exception to this rule is in the case of the Class-type being an Interface, in which assignment legality is a function of whether or not the instance Class implements the given Interface.