Home  |   Reports and Articles  |   Online Seminars  |   Donate  |   Blog  |   About Us

Tuesday, June 09, 2009

Regular (Expression) Magic

by Peter Campbell

Let's get a bit geeky. Many Idealware visitors come here for advice on purchasing and deploying data management systems, such as donor databases, constituent relation management systems and content management systems. And, more often than not, are replacing older systems with new ones, meaning that one of the trickiest tasks is data migration. If any of this work has ever fallen to you, then you might have found yourself doing tedious editing and corrections in Excel, pouring over data screens or rows in Access trying to formalize non-formalized data entry, and generally settling for some lost or incorrect data moving from old system to new.

Wouldn't it be great to have a magic wand that can instantly reformat the data to the proper format? Well, I have one for you. But, just as Harry Potter had to go to school before he could effectively wave his wand, mine comes with a lesson or two as well.

The wand in question is a search/replace language called regular expressions. Regular expressions are a set of terms that can be used, in supported software, to perform advanced search and replace functions. They were originally popularized in the Unix Stream Editor (SED), but are now standardly found in text editors, word processors, scripting languages (such as PHP) and other software, usually as an advanced option.

The reason to use them instead of a regular search and replace function is simple: they can search for things that regular search tools can't. For example:


  • the first three characters at the beginning of each line

  • the three at the end of each line

  • one or more spaces



Regular expressions can also do multiple replacements in one phrase, allowing you to either remove the first comma encountered in a sentence, or all commas. Here are the basics:

A regular expression takes the form of /Search Phrase/Replacement/. A simple search to replace all instances of the word "fish" with the word "bird" would look like:

/fish/bird/

But regular expressions only prove their worth when you learn their special characters:

. (any character)

* (one or more characters)

^ (the beginning of a line

$ (the end of a line)

() (parentheses surrounding characters in the search phrase can be recalled in the replacement)

$1, $2 (substitute in the replacement for characters saved by parentheses in the search phrase)

\ (backslashes treat the next character literally, even if it's a Regular expression special character)

[a-z], [0-9], [A-Za-z] (groupings search for all of the characters specified between the brackets, using dashes to identify ranges

Examples:

If you have a text printout of a document that you want to whittle into something more useful, like a CSV file, step one might be to remove any dead space.

/ */ /

will search for one or more spaces (the asterisk means "any number of the preceding character) and replace them with one space.

/^$/d

will remove all blank lines (lines with nothing between the beginning and the end of the line)

If you are moving data from one system to another, you might have to reformat dates for the new system. Say the old system exports dates as MM/DD/YYYY and the SQL database you're importing them to expects YYYY-MM-DD. This Regular Expression will convert all dates to the new format:

/([01][0-9])\/([0-3][0-9])\/([12][0-9][0-9][0-9])/$3-$1-$2/

Let's break this down:

/ - a slash starts the search phrase section.

( - parentheses surround things that we want to remember, so this starts a section we'll remember.

[01][0-9] - a month (MM) will be a number between 1 and 12, so, if your system is exporting dates with leading zeros (if not, you can do this with a series of regular expressions to get around that), then the [01] set will match either a leading zero or a one. The [0-9] set will match any digit following that one or zero.

) - this will be remembered in the replacement as $1, because it's the first thing we remembered.

\/ - since the slash is a regular expression special character (the delimiter), we precede it with a backslash, telling the parser to treat it a a slash, not a delimiter.

([0-3][0-9]) - this will find any pair of numbers between 01 and 39, which we know as the day, and remember it as $2, because it's enclosed in parentheses.

\/ - next slash

([12][0-9][0-9][0-9]) - this catches the year. You see how, right? It is specifying that the year will be in this millennia or the last by limiting the first character to one or two. We use parentheses to remember this as well.

/ - this slash signifies that the search phrase is done, and the replacement will follow.

$3-$1-$2 - this takes our three remembered phrases and reorders them from month, day, year to year ($3), month ($1), day ($2), placing dashes in-between them.

/ - finally, we close the command with a slash.

One of my standard uses is to take a list - which could be an Excel spreadsheet, or a database dump, or a Word table -- clean it up, and then format it into SQL statements that can then be pulled into a database. Most databases can import in CSV files, but Excel, while good at doing some reformatting, can't do the fancy cleanup tasks that my regular expression-enabled editor can. Once my specific clean-up chores are done, if I'm left with a tab-delimited file, I can do the following three simple searches to turn it into a SQL input file that can just be run in my SQL interpreter.

/\t/','/ -- searches for all tabs (\t is a symbol that means "tab") and replaces them with ','

/(.)$/$1');/ - searches for the last character in a line and replaces it with that character followed by a close quote, close parens and semi-colon.

/^(.)/insert into players (name, title, company) values ('$1/ - searches for the first character in any line and prepends the front end of the SQL statement.

If we had an input file with lines like this:

Joe Namath Quarterback Forty-niners

It would become

insert into players (name, title, company) values ('Joe Namath','Quarterback','Forty-niners');

There are plenty of excellent resources for learning about regular expressions on the web, but many of them are targeted at programmers, making them a bit thick to read through. For more friendly introductions, I recommend The regular-expressions.info quickstart. While many text-processing tools, including Microsoft Word, support regular expression search and replace, I recommend using a good text editor over a word processor, because it will likely include supporting functionality, such as block copying/pasting, and they'll handle very large files with far more speed and grace. I've been happy using TextPad and EditPlus on Windows, and TextMate and TextWrangler on the Mac. Wikipedia publishes an incomplete list of applications that include regular expression functionality.

Labels: , ,

3 Comments:

Blogger thomast said...

Great post. I love regular expressions. A couple comments:

1. Sadly, regular expression formats vary slightly from implementation to implementation, so make sure you read the documentation for whatever implementation you're using. The principles remain, and most of the wildcards and escape characters are consistent; I've found it's mostly whether or not a given program supports all regular expressions.

Also, speaking as a non-programmer who uses regular expressions only occasionally (once a month or less), but has been doing so for years, I find that I need to look up the details of the syntax almost every time I use them. I used to be frustrated by this, but have gotten over it. Once you understand the principles and power of regular expressions, even if you have to look up and experiment to get the right regular expression, you will still save huge amounts of time over making all the edits manually. The other thing that I've made myself not afraid to do is perform multiple passes - if I can quickly imagine how to get the data I need in a short series of regular expression search-and-replace operations, I'll do that, even if I know that it's possible to create a single super-regex that would do it all in one fell swoop.

2. I highly recommend Notepad++ for Windows, an open source, free as in beer and speech text editor with tabbed multi-document interface, regular expression support, and multi-level undo (which I use a lot when I'm working with regular expressions), as well as context highlighting and lots of other great stuff. Development is ongoing and new versions appear regularly. Both of the Windows editors you mention are shareware (though inexpensive at $30-35).

12:58 PM  
Anonymous Jason Lefkowitz said...

"Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems."

-- Jamie Zawinski

7:01 PM  
Blogger Waleed Al-Balooshi said...

When dealing with Regular Expressions I always recommend using tools that simplify writing and debugging regular expression statements.

One such tool is:

http://www.regexbuddy.com

It is not free, but it does provide you with an evaluation version.

7:48 AM  

Post a Comment

<< Home

The Idealware Blog

Thoughts and resources to help nonprofits choose software, from:

Subscribe to This Blog


Recent Posts