Why Regular Expression is So Confusing?

Posted in Articles

Tweet This Share on Facebook Bookmark on Delicious Digg this Submit to Reddit

The reason why Regular Expression (also known as Regex) is so confusing is because they have symbols that means different things in different context or position.  Of all the programming languages that I’ve seen, regular expressions is one that has more symbols that mean multiple things.  Although regular expression is not a programming language.  But if it were, then it would be that.

For those familiar with Regular Expression, you know that the dot is a special wildcard character.  But it is a normal literal character when inside brackets.  The dot also loses it special wildcard characteristics when preceded by the backslash.  That is because the backslash itself is a special character that turns special characters into literal, including itself as in backslash backslash.

The carat has special meaning when at the first character of a regular expression.  It means start of string anchor.  But it does  not have any meaning in any other position.  It does not become a literal, unless preceded by the backslash.  The carat also has special meaning at the first character inside a bracket.  It means “not”.  But it behaves just like any other literal character when not at the first position of a bracketed expression.

For those not familiar with regular expression, that can sound confusing.  So let’s start from the beginning…

Using Regular Expression in Javascript

Start with a simple example. This regular expression …

/ran/

will result in a match when tested on the string “he ran and biked”.

In Javascript, you would use it like this…

regular expression in javascript

regular expression in Javascript

And it would output into the browser console an array of the matched fragment, the index where it matched the string, and the input string…

javascript output

Javascript output

Regular Expression enclosed by slashes

What is unusual is that regular expressions are enclosed by forward slashes / instead of typical quotes for strings.  That means you can have Regular Expression that matches literal quotes …

matches quotes

matches quotes

This will result in a match.   Quotes in a regular expression are literals.

Dot Wild Character in Regular Expression

But a dot in regular expression is not a literal, it is a wild card character.  So the regular expression…

/r.n/

will match “ran”, “run”, and “ron”.  But will not match “Ron” because Regular Expressions are case sensitive, unless you add the “i” modifier.

wildcard matching

wildcard matching

The “myArray” will be null indicating no match.

Regular Expression Modifier

Adding the “case insensitive” modifier like this …

case insensitive modifier

case insensitive modifier

Other modifiers is the “g” which perform a global match so that it finds all matches rather than stopping after the first match.  However, the exec method in Javascript still just return the first match.

The “m” modifier performs multiline matching.

Literal Escaping in Regular Expression

If you want the dot to be a literal instead of a wild card, you escape it with a backslash …

/r\.n/

This will match only strings containing “r.n” and will not match “ran” and “run” like before.

The Double Meaning of Backslash in Regular Expression

You just saw that a backslash in front of a special character (like the wildcard character) turns it into a literal character.   But a backslash in front of a non-special character can sometimes give it special meaning.

The documentation at Mozilla writes it like this …

“A backslash that precedes a non-special character indicates that the next character is special and is not to be interpreted literally. … A backslash that precedes a special character indicates that the next character is not special and should be interpreted literally.”

Confused yet?   For example, \b has a special meaning of “word boundary”.

/\bran\b/

will match “he ran and biked”, but will not match “he random and biked”.

The backslash is a special character.  So putting another backslash in front of it will make it non-special.  This is how you can match a literal backslash …

/c:\\temp/

matches “c:\temp”.

Shorthand Characters in Regular Expression

The backslash is used shorthand characters

\d is “digit character” and shorthand for [0-9]

\w is “word character” and shorthand for [A-Za-z0-9_]    It include alpha-numeric and underscore, but not hypen.  But in certain flavors of regular expression, it may include other characters.

\b is “word boundary”.

\s is “white space” character.  What exactly is a “white space character” depends on the regular expression flavor.  But typically, it includes space, tab, line break, and form feed.  So it is a short hand for [ \t\r\n\f]

To match line returns, use \r, \r, and \r\n

Don’t get confused with \s and \S

\s is a “whitespace” character.

\S is a “non-whitespace” character

Similarly, \D is a non-digit and \W is a non-word character.  And \B is “not a word boundary”.

In certain flavor of regular expression, there is \A and \Z.  But they are not shorthand characters.  They are Metacharacters.

There is also \u for unicode.  But some flavors uses \x.

Regular Expression Meta Characters

The metacharacter ^ indicates the start of string or line

$ indicates the end of string/line.

The are known as start and end anchors.

The \A and \Z means similarly, but not exactly.  It has to do with differences in line breaks and multiline mode.

Other metacharacters are …

* means the thing before it can occur zero or more times.

+ means the thing before it can occur one or more times.

?  means the thing before it can occur zero or one time.

Quantified Repetition

While the * and + allows for unlimited repetition of the item preceding it.  The metacharacters { and } forms a quantified repetition.

/a{2,4}/ matches the letter ‘a’ occurring 2, 3, or 4 times.

/b{0,2}/ matches the letter ‘b’ occurring 0, 1, or 2 times.

/\d{3}/ matches a three digit number

/c{3,}/ matches the letter ‘c’ occurring 3 or any number of times (the second parameter taken as infinity).

So the * is equivalent to {0,}

The + is equivalent to {1,}

The ? is equivalent to {0,1}

Lazy Expressions

Regular expression are “greedy” by default.  Make it “lazy” by adding ? after *, +, and ? like …

*?

+?

??

{min,max}?

The difference …

greedy match

greedy match

lazy match

lazy match

What does Square Brackets mean in Regular Expression

Square brackets are “character class”, or “character set”.  For example,

/r[au]n/

will match “ran” or “run”.  The brackets represents one character that can either be any of those inside the brackets.

For a more complex example…

/r[\d\s]n/

matches “r8n” and “r n”.

Whereas …

/r\d\sn/

matches “r8 n”.

You can test it out in regexpal.com …

regexpal

regexpal

But when entering the regex expression there, don’t put the enclosing forward slashes.

The hyphen inside the bracket is special.  It represent a character range.

[0-9] is equivalent to [0123456789]

A dot inside brackets is a literal dot.  A dot outside brackets is an special wildcard character.  Te dot wildcard character can be thought of as a character set of all characters.

Negative character sets in Regular Expression

If the ^ symbol is the first character inside a characters set, it negates the whole character set.  So for example,

/[^aeiou]/

matches any characters that is not a, not e, not i, not o, and not u.  It matches any non-vowels.  Note that this is not the same as the ^ start anchor metacharacter that is used at the start of a regular expression and outside the brackets.

Grouping Metacharacters and Back References

You’ve seen {} and [].  There is also () known as grouping metacharacters.  They work as expected.  They group things like in math.

The data (not the expression) that is in the grouping character is saved so that it can be back referenced with \1, \2, \3, \4, \5, \6, \7, \8, and \9.

For example

/<(b|strong)>.*?</\1>/

will match

<strong>hi</strong>

and

<b>world</b>

but not

<strong>test</b>

However, certain flavors of regex (such as some regex search-and-replace feature in text editors) uses $1, $2, … , $9 as back references.

In some case, you don’t want groupings () to be captured and back referenced.  In those cases, put ?: as the first two characters of the group.  Such as …

<(?:b|strong)>.*?</\1>

Now the \1 will not be able to capture the group that is in the parenthesis.  We’ve now seen how the ? symbol is used in a third context.

Look Ahead Assertions

But more useful is the Positive Look Ahead Assertions denoted by ?= as the first two characters of a group and Negative Look Ahead Assertions denoted by ?! as the first two characters of a group.

Suppose we want to match passwords that are 6 characters long, but they must contain at least one digit….

(?=.*\d).{6}

Or passwords 6 character long, but must have at least one digit and one alphabet character…

(?=.*\d)(?=.*[A-Za-z]).{6}

The assertion group must match before the regular expression makes a match.

To match words that is followed by a period …

\b\w+\b(?=\.)

An example of Negative Look Ahead Assertion is the following to find all words not followed by a period…

\b\w+\b(?!\.)

Look Behind Assertion

Positive Look Behind Assertions are denoted by the three characters ?<= in the front of a group …

(?<=regex)

Negative Look Behind Assertions are denoted by the three characters ?<! as in …

(?<!regex)

They are less supported and are not supported in Javascript at this time.

Alternation Metacharacters

The | means the alternation metacharacter and works like the “OR” operator.

/dog|cat|bird/

matches “dog”, “cat”, and “bird”

/wheatgrass|germ/

matches “wheatgrass” and “germ”, but it does not matches “wheatgerm”.

/wheat(grass|germ)/

matches “wheatgrass” and “wheatgerm”.

Some Examples

Possible way to match email addresses: /^[\w.%+\-]+@[\w.\-]+\.[A-Za-z]{2,3}$/

There are some invalid email addresses (such as bademail@xxx.xx) that will produce a match.  But usually good enough.  Recall that \w is [A-Za-z0-9_] Here is the break-down…

email regular expression

email regular expression

Similarly, here is a possible regex to match URLs…

 

url regular expression

url regular expression

 

 

 


Related Posts

Tags

Share This