Please login or register. Welcome to the Studio, guest!


Quick Links:


newBookmarkLockedFalling

Jay

Jay Avatar
Le geai revient!

*
New Member

9


July 2006
Okay, I finally finished it; an observation I've been picking at little by little each day. ;D
Originally posted at: G101 Design

Deeply Define your RegEx. An Observation by Jay.

Ahh, you might ask yourself: Strings and RegExp. ??? Where do I start?

Not here. This observation involves discussion for the more adept programmer, one whom knows the general ropes of RegEx and how Regular Expressions are used. There will be (what I call) general discussions, coupled with advanced techniques, however. Never be discouraged, though. My beginner RegEx tutorials found in the Index will lead you in the right direction.


The Preface:
So commonly used amongst programmers in all types of coding alike. Regular Expressions are most commonly used to manipulate a string of characters without specifying them exactly. These manipulations can be helpful to obtain information about cookies, forms, and modifcations of certain complicated URL strings (just to name a few).

I see alot of of people here at Proboards use Regular Expressions, both poorly and very well. In this lesson (or observation), we'll be discussing how RegEx is used with Javascript. I want to show you how to deeper define your RegEx so that it's never a hit-miss situation. You'll have more confidence and more knowledge when defining your next Regular Expression.


Digging In:
Not to waste any time, let's go ahead and start with a fairly easy example (and commonly used). You might want to manipulate or use the PM area text.

Your string contains: Hey, Jay, you have <a href="/index.cgi?action=pm">11 messages</a>, 0 are new.
You might use: /Hey, (\w+), you have <.*>(\d+) messages<.*>, (\d+) .* new\./;

Okay, nothing new there. You've now stored the username in RegExp.$1, the total amount of messages in RegExp.$2, and the total amount of new messages in RegExp.$3.

You will match correctly, but to deeper define, you might use:
/Hey, ([a-zA-Z0-9]*), you have <.*>([0-9]+) messages<.*>, ([0-9]+) .* new\./;

I'm now going to explain for those who have never seen the above syntax before.

Notice we used [a-zA-Z0-9]* to snatch the username. We're testing the string for any alphanumeric character. a-z defines lowercase. A-Z defining upper, and 0-9 defining the numbers, followed by a * quantifier to denote a match on the previous item that occurs zero or more times.

With that in mind, you can use [0-9]+ to get the PM numbers. Your only asking the string if it contains a numeric character. We follow up with a + quantifier to make a match on the previous item one or more times. Your range operator inside the brackets is the dash. If your only looking for letters a-q, you would use [a-qA-Q]*, and numbers 2-6 would be [2-6]+, or match any number with it's shorthand - \d+. Let's move on.


Character Classes:
Certain situations might ask that you use character classes to deeper define your RegEx. Character classes are nothing more than [] brackets that hold a key character. Take the following pattern for example:

/[jrmf]ailed/;

You might of guessed that this will only match "jailed", "railed", "mailed", and "failed". Character classes can only contain one key character to work off of. Let's look at an example that would not match:

/[chthrst]ill/;

Here you are trying to match "chill", "thrill", and "still", but cannot because you have used more than one key character inside the [] brackets. Instead you will match "cill", "hill", "till", "rill", and "sill"; something you probably weren't after.

The classic phone number example:

Let's say we're validating a standard 10 digit phone number. We're checking to see if it's in proper syntax (nnn-nnn-nnnn). Area code, prefix, and suffix, with dashes in their proper places. Our pattern could look like:

/[0-9]{3}-[0-9]{3}-[0-9]{4}/;

We've asked that it contain a number from 0-9, three times ({3}), followed by a dash, followed by a number from 0-9, three times ({3}), followed by a dash, followed by a number from 0-9, four times ({4}). Curly braces are quantifiers that indicate how many times the previous item may match.

To further validate our phone number example, I've written a simple function below to return true in the case that the number is in valid format, and false if it isn't.

<script type="text/javascript">
<!--

function do_validate(the_num) {

// Valid phone number Regular Expression
var valid_phone_number = /[0-9]{3}-[0-9]{3}-[0-9]{4}/;

    // Looping through all characters in the string
    for (var ph = 0; ph < 12; ph ++) {

        // Asking the string if a '-' is located at position 3 or 7
        if (ph == 3 || ph == 7) {

            // If it doesn't, then return false. It's not a valid phone number
            if (the_num.charAt(ph) != '-') {
                return false;
            }
        }

        // If it does
        else {

            // Return true
            return valid_phone_number.test(the_num);
        }
    }
}

document.write('Is this a valid phone number? ' + do_validate('434-664-4543'));

//-->
</script>


Moving on.

Quantifiers:
To specify the number of times an item or character may pass through a string, we use quantifiers. A quantifier can be a simple asterick * noting that the previous item may occur zero or more times. Take a look at the pattern below:

/^[A-G][a-sA-D0-1_-]*/;

Can you guess what I tried to match? "G101-Design". This pattern will also match "G101_Design", "G101Design", but not "_G101-Design", nor "G101 Design". Note how we reversed the range operator with an underscore. This allows you to check the string for underscores as well as dashes. Let's look at one more example:

/[^a-zA-Z0-9]+/;

Notice the carat (^) placed in the character class just before a-z. This notes that the string must contain atleast one non-alphanumeric character. Negative character classes are ones that specify which characters cannot be present in the string. "G101", "SSD", and "I like Pizza" would not match, whereas "G101!", "SSD_", and "Pizza!!1" would.


Quantifier Cheatsheet:
Below is a list of quantifiers you can use to match previous characters or items.

? - Matches the previous item zero or one times.
* - Matches the previous item zero or more times.
+ - Matches the previous item one or more times.
{n, x} - Matches the previous item at a minimum of n times, but no more than x times.
{n,} - Matches the previous item n or more times.
{n} - Matches the previous item exactly n times.


More Character Classes:
Character classes are not limited to just [] brackets. We can use them to match digits, spaces, line breaks, tabs...the list goes on.

. - Matches any character except one found on a new line.
\w - Matches any word character.
\W - Matches any non-word character.
\s - Matches any whitespace found in the string.
\S - Matches any non-whitespace.
\d - Matches any digit or number.
\D - Matches any non-digit or number.
\n - Matches a line break.
\b - Matches any word boundry or empty space between \w and \W
\B - Matches any non-word boundry.
\t - Matches any tab found in the string.
\v - Matches a vertical tab (very rarely used).
\f - Matches a form feed.
\r - Matches a carriage return.


Escape Codes:
Always start with a backslash "\" and generally used if your string contains slashes "/", astericks, question marks, etc:

\/ - Matches a forwardslash "/"
\\ - Matches a backslash "\"
\. - Matches a period.
\* - Matches an asterick.
\+ - Matches a plus sign.
\| - Matches a horizontal bar "|"
\( - Matches left parenthesis
\) - Matches right parenthesis
\[ - Matches left bracket "["
\] - Matches right bracket "]"
\{ - Matches left curly brace "{"
\} - Matches right curly brace "}"


Summation:
In summary, I hope you learned something by reading this. I presented the information as clear as possible because I believe presentation is key. Now it's your job to suck in the information, get your hands dirty, and explore the new possibilities you have learned. Comments, questions, or anything you might have to add are music to my ears. They are the building blocks that bring such tutorials for your viewing pleasure. I value user input at a fairly heightened state.

Jay :)


Eric

Eric Avatar



1,442


November 2005
A pretty good tutorial but there is one thing that needs to be fixed. You have a greedy regular expression up top that needs to be non-greedy.

<.*> should be <.*?>

Edit: To make the whole chill, hill, thrill thing work you'd do this:
(ch|thr|st)ill


Last Edit: Jul 19, 2006 15:51:08 GMT by Eric

Chris

Chris Avatar

******
Head Coder

19,519


June 2005
Didn't get anything out of it personally, but its a very nice tutorial.... but maybe it should be in the Tutorials board? (sub-board of Code Database). If you want, I'll move it. Maybe you should challenge the users to write a RegExp to match any username they use?

Like, this one would match Chris, CD, CD-ROM, and CDDude229 (case is ignored of course)

/^C(hris|D(-ROM|Dude229)?)$/i;

Just something simple like that.

newBookmarkLockedFalling