NAME
Lexing Without Grammars: When Regular Expressions Suck
ABSTRACT
Perl is famed for its text processing capabilities. However, sometimes the data you want to process is too complicated for regular expressions and you reach for a parser for your HTML, RTF, or other common format. This article discusses when you don't have a pre-defined parser but the text you need to work with is too complicated for regular expressions.
ORIGINAL PUBLICATION
This article was originally published by O'Reilly in January, 2006 at http://www.perl.com/pub/a/2006/01/05/parsing.html. Reproduced with permission.
s/(?<!SHOOTING YOURSELF IN THE )FOOT/HEAD/g
Most of us have tried at one time or another to use regular expressions to do things we shouldn't. Parse HTML, obfuscate code, wash dishes, etc. This is referred to by the technical term "showing off". I've done it too:
$html =~ s{
(<a\s(?:[^>](?!href))*href\s*)
(&(&[^;]+;)?(?:.(?!\3))+(?:\3)?)
([^>]+>)
}
{$1 . decode_entities($2) . $4}gsexi;
I was strutting like a peacock when I wrote that, followed quickly by eating crow when I ran it. I never did get that working right. I'm still not sure what I was trying to do. That was the regular expression which forced me to learn how to use HTML::TokeParser
. More importantly, that was the regular expression which taught me how difficult regular expressions can be.
The Problem with Regular Expressions
Let's look at that regex again:
/(<a\s(?:[^>](?!href))*href\s*)(&(&[^;]+;)?(?:.(?!\3))+(?:\3)?)([^>]+>)/
Do you know that matches? Exactly? Are you sure? Even if it works, how easily can you modify it?
If you don't know what it was trying to do (and to be fair, don't forget it's broken), how long did you spend trying to figure it out? When's the last time a single line of code gave you such fits?
The problem, of course, is that this regular expression is trying to do far more work than a single line of code is likely to do. When faced with a regular expression like that, there are a few things I like to do.
Document them carefully.
Use the /x switch so I can expand them over several lines.
Possibly encapsulate them in a subroutine.
Sometimes, though, there's a fourth option: lexing.
Lexing
When developing code, we typically take a problem and break it down into a series of smaller problems which are easier to solve. Regular expressions are code and they too can be broken down into a series of smaller problems which are easier to solve. One technique is to use lexing to facilitate this.
Lexing is the act of taking data, breaking it down into discreet tokens and assigning meaning to those tokens. There's a bit of fudging in that statement, but it pretty much covers the basics.
Lexing is typically followed by parsing whereby the tokens are then converted into something more useful. Parsing is frequently handled by having some tool which applies a well-defined grammar to the lexed tokens.
Sometimes well-defined grammars are not practical for extracting and reporting information. There might not be a grammar available for a company's ad-hoc log file format. Other times we might find it easier to process the tokens manually then to spend the time writing a grammar. And still other times we are only interested in part of the data you've lexed, not all of it. All three of these reasons apply to the following problem.
Parsing SQL
Recently on Perlmonks (http://perlmonks.org/index.pl?node_id=472684), someone had the following SQL:
select the_date as "date",
round(months_between(first_date,second_date),0) months_old
,product,extract(year from the_date) year
,case
when a=b then 'c'
else 'd'
end tough_one
from ...
where ...
What they needed from that SQL was the alias for each column. In this case, those would be "date", "months_old", "product", "year", "tough_one." Of course, they mentioned that this was only one example. There's actually plenty of generated SQL, all with subtle variations on how the columns are aliased so this is not a trivial task. What's interesting about this, though, is that we don't give a fig about anything except the column aliases. The rest of the text is merely there to help us find those aliases.
Our first thought might be to try and parse this with SQL::Statement
. As it turns out, this module does not handle CASE
statements. Thus, we're left with either trying to figure out how to patch SQL::Statement
, submit said patch, hope it gets accepted and released in a timely fashion. (Note that SQL::Statement
uses SQL::Parser
, so the latter is also not an option).
Second, a number of us have worked in environments where problems have to be solved in production now but we still have to wait three weeks for the necessary modules to be installed, if they'll be approved at all.
The most important reason, though, is even if SQL::Statement
could handle this problem, this would be an awfully short article if you used it instead of a lexer.
Lexing Basics
As mentioned earlier, lexing is essentially the task of analyzing data and breaking it down into a series of easy-to-use tokens. While the data may be in other forms, usually this means analyzing strings. To give a trivial example, consider the following:
x = (3 + 2) / y
When lexed, we might get a series of tokens like the following:
my @tokens = (
[ VAR => 'x' ],
[ OP => '=' ],
[ OP => '(' ],
[ INT => '3' ],
[ OP => '+' ],
[ INT => '2' ],
[ OP => ')' ],
[ OP => '/' ],
[ VAR => 'y' ],
);
With a proper grammar, we could then read this series of tokens and take actions based upon their values, such as build a simple language interpreter or translate this code into another programming language. Even without a grammar we can find these tokens useful as we'll see with the SQL example.
Identifying tokens
The first step in building a lexer is identifying the tokens you wish to parse. Let's take another look at the SQL.
select the_date as "date",
round(months_between(first_date,second_date),0) months_old
,product,extract(year from the_date) year
,case
when a=b then 'c'
else 'd'
end tough_one
from ...
where ...
We really don't care about anything after the from
keyword. In looking at this closer, we see that everything we do care about is immediately prior to a comma or the 'from' keyword. However, splitting on commas isn't enough as we have some commas embedded in function parentheses.
The first thing we need to do is to identify the various things we can match with simple regular expressions.
These "things" appear to be parentheses, commas, operators, keywords and random text. A first pass at it might look something like this:
my $lparen = qr/\(/;
my $rparen = qr/\)/;
my $keyword = qr/(?i:select|from|as)/; # this is all this problem needs
my $comma = qr/,/;
my $text = qr/(?:\w+|'\w+'|"\w+")/;
my $op = qr{[-=+*/<>]};
The text matching is somewhat naive and we might want Regexp::Common
for some of the regular expressions, but for now we'll keep this simple.
The operators are a bit more involved as we'll assume that some SQL might have math statements embedded in them.
Then we create the actual lexer. One way to do this is to make our own lexer. It might look something like this:
sub lexer {
my $sql = shift;
return sub {
LEXER: {
return ['KEYWORD', $1] if $sql =~ /\G ($keyword) /gcx;
return ['COMMA', ''] if $sql =~ /\G ($comma) /gcx;
return ['OP', $1] if $sql =~ /\G ($op) /gcx;
return ['PAREN', 1] if $sql =~ /\G $lparen /gcx;
return ['PAREN', -1] if $sql =~ /\G $rparen /gcx;
return ['TEXT', $1] if $sql =~ /\G ($text) /gcx;
redo LEXER if $sql =~ /\G \s+ /gcx;
}
};
}
my $lexer = lexer($sql);
while (defined (my $token = $lexer->())) {
# do something with the token
}
Without going into the detail of how that works, it's fair to say that this is not the best solution. By looking at the original post this came from, (http://perlmonks.org/index.pl?node_id=472701), we find that we need to make two passes through the data to extract what we want. Why this is the case is an exercise left for the reader.
To make this simpler, we're going to use the HOP::Lexer
module from the CPAN. This module, described by Mark Jason Dominus in his book "Higher Order Perl", makes creating lexers a rather trivial task and makes them a bit more powerful than what we have above. Here's our code:
use HOP::Lexer 'make_lexer';
my @sql = $sql;
my $lexer = make_lexer(
sub { shift @sql },
[ 'KEYWORD', qr/(?i:select|from|as)/ ],
[ 'COMMA', qr/,/ ],
[ 'OP', qr{[-=+*/]} ],
[ 'PAREN', qr/\(/, sub { [shift, 1] } ],
[ 'PAREN', qr/\)/, sub { [shift, -1] } ],
[ 'TEXT', qr/(?:\w+|'\w+'|"\w+")/, \&text ],
[ 'SPACE', qr/\s*/, sub {} ],
);
sub text {
my ($label, $value) = @_;
$value =~ s/^["']//;
$value =~ s/["']$//;
return [ $label, $value ];
}
This certainly doesn't look any easier to read, but bear with me.
The make_lexer
subroutine takes as its first argument an iterator which returns the text to match every time it's called. In our case, we only have one snippet of text to match, so we merely shift it off of an array. If we were reading lines from a log file, the iterator would be quite handy.
After the first argument, we have a series of array references. Each reference takes two mandatory and one optional argument:
[ $label, $pattern, $optional_subroutine ]
The $label
will be used as the name of the token. The pattern should match whatever the label is identifying. The third argument, a subroutine reference, takes as arguments the label and the text the label matched and returns whatever you wish for a token. We'll get to that in a moment. First, let's consider how we typically use the make_lexer
subroutine.
[ 'KEYWORD', qr/(?i:select|from|as)/ ],
An example of how we might transform the data before making the token is as follows:
[ 'TEXT', qr/(?:\w+|'\w+'|"\w+")/, \&text ],
As mentioned previously, our regular expression might be naive, but we'll leave that for now and focus on the &text
subroutine.
sub text {
my ($label, $value) = @_;
$value =~ s/^["']//;
$value =~ s/["']$//;
return [ $label, $value ];
}
This says "take the label and the value, strip leading and trailing quotes from the value and return them in an array reference".
To strip the whitespace, something we don't care about, we simply return nothing:
[ 'SPACE', qr/\s*/, sub {} ],
Now that we have our lexer, let's put it to work. Remember that we had decided that column aliases were the TEXT
not in parentheses but immediately prior to commas or the from
keyword. But how do we know if we're inside of parentheses? We're going to cheat a little bit:
[ 'PAREN', qr/\(/, sub { [shift, 1] } ],
[ 'PAREN', qr/\)/, sub { [shift, -1] } ],
With that, we can add a one whenever we get to an opening parenthesis and subtract it when we get to a closing one. Whenever the result is zero, we know that we're outside of parentheses.
We can get the tokens by repeatedly calling the $lexer
iterator.
while ( defined (my $token = $lexer->() ) { ... }
And the tokens would look like this:
[ 'KEYWORD', 'select' ]
[ 'TEXT', 'the_date' ]
[ 'KEYWORD', 'as' ]
[ 'TEXT', 'date' ]
[ 'COMMA', ',' ]
[ 'TEXT', 'round' ]
[ 'PAREN', 1 ]
[ 'TEXT', 'months_between' ]
[ 'PAREN', 1 ]
And so on ...
Here's how we process the tokens:
1: my $inside_parens = 0;
2: while ( defined (my $token = $lexer->()) ) {
3: my ($label, $value) = @$token;
4: $inside_parens += $value if 'PAREN' eq $label;
5: next if $inside_parens || 'TEXT' ne $label;
6: if (defined (my $next = $lexer->('peek'))) {
7: my ($next_label, $next_value) = @$next;
8: if ('COMMA' eq $next_label) {
9: print "$value\n";
10: }
11: elsif ('KEYWORD' eq $next_label && 'from' eq $next_value) {
12: print "$value\n";
13: last; # we're done
14: }
15: }
16: }
This is pretty straightforward, but there are some tricky bits. Each token is a two element array reference, so line three makes the label and value fairly explicit. Lines four and five use the "cheat" we mentioned for handling parentheses. Five also skips anything which isn't text and therefore cannot be a column alias.
Line six is a bit odd. In the HOP::Lexer
, passing the string peek
to the lexer will return the next token without actually advancing the $lexer
iterator. From there, it's straightforward logic to find out if the value we have is a column alias which matches our criteria.
Putting all of this together we have:
#!/usr/bin/perl
use strict;
use warnings;
use HOP::Lexer 'make_lexer';
my $sql = <<END_SQL;
select the_date as "date",
round(months_between(first_date,second_date),0) months_old
,product,extract(year from the_date) year
,case
when a=b then 'c'
else 'd'
end tough_one
from XXX
END_SQL
my @sql = $sql;
my $lexer = make_lexer(
sub { shift @sql },
[ 'KEYWORD', qr/(?i:select|from|as)/ ],
[ 'COMMA', qr/,/ ],
[ 'OP', qr{[-=+*/]} ],
[ 'PAREN', qr/\(/, sub { [shift, 1] } ],
[ 'PAREN', qr/\)/, sub { [shift, -1] } ],
[ 'TEXT', qr/(?:\w+|'\w+'|"\w+")/, \&text ],
[ 'SPACE', qr/\s*/, sub {} ],
);
sub text {
my ( $label, $value ) = @_;
$value =~ s/^["']//;
$value =~ s/["']$//;
return [ $label, $value ];
}
my $inside_parens = 0;
while ( defined ( my $token = $lexer->() ) ) {
my ( $label, $value ) = @$token;
$inside_parens += $value if 'PAREN' eq $label;
next if $inside_parens || 'TEXT' ne $label;
if ( defined ( my $next = $lexer->('peek') ) ) {
my ( $next_label, $next_value ) = @$next;
if ( 'COMMA' eq $next_label ) {
print "$value\n";
}
elsif ( 'KEYWORD' eq $next_label && 'from' eq $next_value ) {
print "$value\n";
last; # we're done
}
}
}
And that prints out the column aliases.
date
months_old
product
year
tough_one
So are we done? No, probably not. What we really need now are many other examples of the SQL generated in the first problem statement. Maybe the &text
subroutine is naive. Maybe there are other operators we forgot. Maybe there are floating point numbers embedded in the SQL. When we are forced to lex data by hand, fine-tuning the lexer to match your actual data can take a few tries.
It's also important to note that precedence is very important here. Each array reference passed to &make_lexer
is evaluated in the order it's passed. If we passed the 'TEXT' array reference before the 'KEYWORD' array reference, the 'TEXT' regular expression would match keywords before the 'KEYWORD' could match, thus generating spurious results.
Happy lexing!