NAME

Gnaw - Define parse grammars using perl subroutine calls. No intermediate grammar languages.

VERSION

Version 0.11

SYNOPSIS

Gnaw is a perl module which implements full regular expressions and full text parsing grammars using nothing but pure perl code limited to subroutine closures, exception trapping via eval, and basic perl variables such as scalars, hashes, and arrays.

You write your grammar in pure perl. There is no intermediate "parser language" that then gets interpreted into something executable.

When you do a "use Gnaw", the Gnaw module will import a number of functions directly into your namespace. Yes, this is completely bad form for normal modules. But this is not a normal module. The imported subroutines include regular expression and parsing equivalents for matching, quantifiers, literals, alternations, character classes, and so on. You build up your grammar by calling these functions. The final call will return a code reference. This code reference is your grammar.

When you dereference that grammar, if it is a "match" grammar (i.e. $string =~ m//) then you pass in the string you want to parse.

use Gnaw;

# create the grammar
my $grammar = match(lit('hello'));

# apply the grammar to a string
if($grammar->('hello world')) {
	print "match\n";
} else {
	print "no match";
}

Please note that this is a beta release. This is more of a proof of concept than something ready for production code or for massive grammars. The interfaces may change completely in the future. When the interfaces have settled, I will release this as a version 1.0+ module. Until then, please do not use this to develop some gigantic parser when the grammar may have to completely change.

EXPORT

Currently, EVERYTHING is exported into the callers namespace. Yes, I know, bad programmer. No biscuit.

The problem is I did not want grammars to be weighed down with a bunch of package prefixes at every step.

my $grammar = Gnaw::match(Gnaw::alternation(Gnaw::lit('hello'),Gnaw::lit('howdy'));

Gets rather tedious.

Hopefully, no one will have collisions with subroutine names. If they do, then it might make sense to create a new package, declare the grammar in there, then simply pull out the code reference and use that in the rest of your code.

Again, everything may change tomorrow as I sort this out.

FUNCTIONS

Most functions in Gnaw can be categorized as "inner" or "outer" functions.

The "inner" functions are functions that are used inside a grammar to define that grammar. This would include basic functions like "lit" and "set" and more advanced functions like "altenation" and "quanifier". These inner functions may or may not take any parameters to them ("thing" for example takes no parameters. "quantifier" takes several and some are optional.) The inner functions return code refs which are then used as part of a larger grammar. An "inner" grammar function may be an input to another "inner" function. For example, the "quantifier" function must recieve an inner function to which it will apply the quantifier. The "quantifier" will then return a coderef which can be used as part of a larger grammar or may be passed to an outer function.

The "outer" functions are functions that are the outer command of a grammar. They take one or more "inner" commands as input parameters, and they return a coderef which defines the whole grammar. This coderef can then be dereferenced to execute the grammar and apply it to a string. When the coderef from the outer function is dereferenced, it may take the string it is operating on as its input parameter. The "match" function is an example of this behaviour.

Inner Functions

The "inner" functions are functions that are used inside a grammar. They may recieve other "inner" functions and their return value may be used by other "inner" functions or used by an "outer" function.

series

The "series" function is a gnaw grammar component which takes a series of other grammar components. This is the only way to define a grammar with one component occurring after another. The "series" function takes a series of other grammar components and returns a coderef to that portion of the grammar.

The "series" function returns a coderef that is used in part of a larger grammar.

# look for a series of two literals, "hello" followed by "world"
my $grammar = match( series(lit('hello'), lit('world')) );

# apply the grammar to a string
if($grammar->('hello world')) {
	print "match\n";
} else {
	print "no match";	
}	

lit

The "lit" function is a gnaw grammar component which applies a literal string value to the string being parsed. The literal value may be a single character or more than one character.

The "lit" function returns a coderef that is used in part of a larger grammar.

# look for the literal string "hello" in the string being parsed
my $grammar = match(lit('hello'));

# apply the grammar to a string
if($grammar->('hello world')) {
	print "match\n";
} else {
	print "no match";
}

set

The "set" function is a gnaw grammar component which applies a character class or character set to the string being parsed. Since "class" is a perl reserved word, gnaw uses the word "set" for character set. The "set" function takes a string which describes the character class. The "set" function parses one metacharacter within the string and that is a '-' character. This is used to define a range of charcters. All digits can be described as "0-9". All letters can be described as "a-zA-Z". If you want the "-" character to be part of the set itself, make it the first character in the string you pass into set.

The "set" function returns a coderef that is used in part of a larger grammar.

# look for an x, y, or z within the string being parsed
my $grammar = match(set("xyz"));

# apply the grammar to a string
if($grammar->('hello world')) {
	print "match\n";
} else {
	print "no match";
}

SET

The "SET" function is a gnaw grammar component which applies a NEGATIVE character class or NEGATIVE character set to the string being parsed. The "SET" function takes a string which describes the character class. The "SET" function parses one metacharacter within the string and that is a '-' character. This is used to define a range of charcters. All digits can be described as "0-9". All letters can be described as "a-zA-Z". If you want the "-" character to be part of the set itself, make it the first character in the string you pass into set.

The "SET" function returns a coderef that is used in part of a larger grammar.

# look for anything other than an x, y, or z in the string being parsed.
my $grammar = match(SET("xyz"));

# apply the grammar to a string
if($grammar->('hello world')) {
	print "match\n";
} else {
	print "no match";
}

set_digit

The "set_digit" is a shortcut equivalent to set('0-9').

SET_DIGIT

The "SET_DIGIT" is a shortcut equivalent to SET('0-9'). i.e. [^0-9]

set_whitespace

The "set_whitespace" is a shortcut equivalent to set("\t\n\r\f").

SET_WHITESPACE

The "SET_WHITESPACE" is a shortcut equivalent to SET("\t\n\r\f"). i.e. [^\t\n\r\f]

set_identifier

The "set_identifier" is a shortcut equivalent to set('a-zA-Z0-9_').

SET_IDENTIFIER

The "SET_IDENTIFIER" is a shortcut equivalent to SET('a-zA-Z0-9_'). i.e. [^a-zA-Z0-9_]

thing

The "thing" function is a gnaw grammar component which matches any single character in the string being parsed. It is equivalent to the '.' operator in normal regular expression format. I would have called it "character" but that is a bit long and "char" is usually a reserved word.

my $grammar = match(lit('b'), thing, lit('b'));

# these will all match.
$grammar->("bob");
$grammar->("bib");
$grammar->("bub");

# this will fail to match.
$grammar->("bb");

The "thing" function returns a coderef that is used in part of a larger grammar.

alternation

The "alternation" function is a gnaw grammar component which applies one of several possible alternatives to the string being parsed. The "alternation" function will attempt each possible alternative in the order it is passed into the function as a parameter. Each alternative must be a single command.

# look for people we know
my $grammar = match(alternation(lit('alice'), lit('bob')));

# apply the grammar to a string
if($grammar->('hello alice')) {
	print "match\n";
} else {
	print "no match";
}

If an alternative needs to be made of more than one grammar command, either bundle them together using a "series" function, or create a named subroutine that will act as a named rule for your grammar, and call that subroutine as one of your alternates.

# one alternative will be a series of two literals, 'hello' followed by 'world'.
# create a subroutine that will contain this rule.
sub greet_all { series(lit('hello'), lit('world'));}

# another alternative will be a series of two literals, "howdy" followed by "partner"
sub greet_one { series(lit('howdy'), lit('partner'));}

# look for either greeting	
my $grammar = match(alternation(greet_all, greet_one));

# apply the grammar to a string
if($grammar->('hello world')) {
	print "match\n";
} else {
	print "no match";
}

The "alternation" function returns a coderef that is used in part of a larger grammar.

quantifier

The "quantifier" function is a gnaw grammar component which receives a single grammar component and attempts to apply that command repeatedly to the string being parsed. The parameters passed into the "quantifier" function are defined as follows:

quantifier( thrifty_or_greedy , single_grammar_component , minimum , maximum );

The thrifty_or_greedy parameter is a 't' or a 'g' to indicate whether the single command is applied as a thrifty or greedy quantifier. It is an optional parameter and may be skipped. If no thrifty_or_greedy parameter is provided, the quantifier will assume greedy.

The single_grammar_component is a single gnaw grammar component. If the quantifier is to be applied to more than one command in a series, either bundle those commands up in a series() function or bundle them up in a named subroutine that can act as a separate rule.

The minimum parameter is the minimum number of times the component must successfully apply to the string being parsed for the grammar to be considered successful. The smallest legal value for "minimum" is zero. You must provide a minimum value of some kind.

The maximum parameter is the maximum number of times the component must successfully apply to the string being parsed for the grammar to be considered successful. The maximum value is optional. if no maximum value is provided, the parser will try as many as possible.

The "quantifier" function also allows some shortcuts instead of the numeric "minimum" and "maximum" parameters.

'*' means "0 or more"

's' and '+' means "1 or more"

# look for 3 to 7 letter 'a'. Use a thrifty search.
my $grammar = match(quantifier('t', lit('a'), 3,7));

# look for 3 or more letter 'a'. still thrifty.
my $grammar = match(quantifier('t', lit('a'), 3));

# look for 3 or more letter 'a'. greedy search.
my $grammar = match(quantifier('g', lit('a'), 3));

# look for 3 or more letter 'a'. greedy search.
my $grammar = match(quantifier(lit('a'), 3));

# look for 1 or more letter 'a'. greedy search.
my $grammar = match(quantifier(lit('a'), 's'));

# look for 1 or more letter 'a'. greedy search.
my $grammar = match(quantifier(lit('a'), '+'));

# look for 0 or more letter 'a'. greedy search.
my $grammar = match(quantifier(lit('a'), '*'));

The "quantifier" function returns a coderef that is used in part of a larger grammar.

thrifty

The "thrifty" function is a shortcut to the "quantifier" function with the thrifty/greedy parameter forced to thrifty.

greedy

The "greedy" function is a shortcut to the "quantifier" function with the thrifty/greedy parameter forced to greedy.

some

The "some" function is a shortcut to a "quantifier" function set to greedy, and the quantity set to "1 or more".

any

The "any" function is a shortcut to a "quantifier" function set to greedy, and the quantity set to "0 or more".

something

The "something" function is a shortcut to a "quantifier" function set to greedy, the quantity set to "1 or more", and the command set to "thing". This is equivalent to the '.+' operator in the usual regular expression syntax.

anything

The "anything" function is a shortcut to a "quantifier" function set to greedy, the quantity set to "0 or more", and the command set to "thing". This is equivalent to the '.+' operator in the usual regular expression syntax.

callback

If you want to call a user-defined callback any time the parser hits a specific point of the grammar, simply insert a reference to a subroutine in that location in the grammar and it will be called every time the parser hits that location, whether the grammar ends up succeeding later or not.

# want between 7 and 9 letter 'a'. and every time we try, print the letter "X".
my $grammar = match(quantifier('t', series(lit('a'), sub{print"X\n";}) , 7,9));

# this will not match, but it will print out an "X" for every time 
# "quantifier" tried to match before the parser fails.
$grammar->('aaaaa');

The "callback" function takes a single code reference to any user-defined subroutine and calls that coderef only if the parser succeeds. Success is defined as either (1) reaching the end of the grammar and successfully matching the string being parsed or (2) the grammar executes a "commit" function, which is defined later.

Note that "callback" is a scheduled call and only makes the actual call when grammar succeeds.

# look for a series of two literals, "hello" followed by "world", 
my $grammar = match(greedy( series(lit('a'), callback(sub{print"X\n";})), 7,9) );

# this will fail to match and no callback will be called.
$grammar->('aaaaa');

# this will match and all the callbacks will be called at the end.
$grammar->('aaaaaaaaaaaaaaaa');

capture

The "capture" function is a specialized version of "callback". The user passes two parameters into "capture", (1) a grammar component and (2) a callback, a code reference to a user-defined subroutine which will be called when the grammar succeeds.

The callback will receive a copy of the string containing whatever was captured within the given grammar component. This will be passed via the @_ variable.

my $capture_callback = sub{
	my ($string) = @_;
	print "captured string is '$string'\n";
};


# capture the first "a" we find.
my $grammar = match(capture(lit('a'), $capture_callback)));

# this will not match, the capture callback will not be called.
$grammar->('xxxxx');

# this will match and the capture callback will be called upon success.
$grammar->('aaaaa');

Note that "capture" creates a scheduled call and only makes the actual call when the grammar succeeds.

consumable

The "consumable" function is useful for parsing large portions of text. The "consumable" function receives a single grammar component and whenever that component successfully parses, even though the full grammar has not finished, then "consumable" will remove any internal data structures related to the command. If there are any callbacks scheduled inside of that command, they will be executed before being deleted.

If you are parsing or matching a small block of text, then "consumable" might not be useful for your application.

If you have a hierarchical grammar and all the subrules from some level on down are wrapped as "consumable", then theoretically, that grammar could parse an infinite amount of text without running out of memory. As long as the consumable subrules match blocks of text that fit in memory, then the parser will operate on one consumable block, and then delete it from memory when that block matches.

Note that "consumable" acts as an immediately executed code reference rather than a scheduled callback in the sense that it gets called as soon as the single grammar command it contains succeeds.

If the grammar later fails and part of the reason was because that consumable portion did not actually match, then the parser will not be able to retry the grammar from prior to the consumable portion. That portion is gone.

When the single grammar command succeeds, the "consumable" function deletes the text currently matching that single grammar command and it deletes the portion of the call tree that corresponds to everything executed as part of that single grammar component.

my $captured=[];

my $callback = sub {
	my $text = shift(@_);
	push(@$captured, $text);
};

sub fruit {alternation(lit('apple'), lit('pear'), lit('peach'), lit('orange'))}
sub eatfruit {  consumable(capture( fruit , $callback)) }
$grammar = match( greedy(eatfruit, 's'));

ok( 1==$grammar->('peach apple pear orange potato'), "1.1 confirm match");

ok( ($captured->[0]) eq 'peach', "2.2, checking capture");
ok( ($captured->[1]) eq 'apple', "2.3, checking capture");
ok( ($captured->[2]) eq 'pear', "2.4, checking capture");
ok( ($captured->[3]) eq 'orange', "2.5, checking capture");

When this finishes, the text left in memory would be ' potato'. If you are parsing a few lines of text, you may want to avoid "consumable" for its added complexity with no other noticable functionality. The "consumbable" function becomes a neccessity when parsing large blocks of text. It allow the parser to delete blocks of text that match the sub-command portion of the grammar.

Remember, once text is consumed, it cannot be recovered.

The "consumable" function returns a coderef that is used in part of a larger grammar.

Outer Functions

Outer functions are functions that must be the outermost command in the grammar. They return a codereference which contains the entire grammar. When executed, they generally receive a single parameter that is the string being operated on.

match

This is equivalent to the "m" part of a perl regexp of $string\=~m//. The match function takes a grammar and attempts to find the first match within the string. If a match is found, the function returns true (1), else it returns false (0). The match function takes a series() of grammar components such as lit, set, quantifier, etc. The match function returns a coderef to the grammar. The "match" function should have no other grammar components outside of it. When calling the grammar, dereference the coderef returned by "match" and pass it the string you want to apply to the grammar.

# create the grammar
my $grammar = match(lit('hello'));

# apply the grammar to a string
if($grammar->('hello world')) {
	print "match\n";
} else {
	print "no match";
}

YET TO BE DEVELOPED

This is a list of some other functionality that has not yet been fully developed but is on my list.

The Gnaw module currently supports whitespace skipping. By default, skipping is turned on. Skipping can be disabled for the entire grammar. I am currently working on a clean interface to allow skipping to be turned on and off during the execution of a grammar. The problem is that Gnaw does a lot of "eval" calls and exception throwing in teh middle of subroutines. if an upper rule has skipping off, and a lower rule turns it on, but then throws an exception, not sure how to know what the "skip" setting should be set to when the exception is trapped. maybe make it a key/value pair that gets stored in the call tree.

The Gnaw module should be easily convertable to parsing a file rather than a string passed in as a parameter during a subroutine call. By rewriting the functions __gnaw__at_end_of_string and/or __gnaw__move_pointer_forward, the Gnaw module should be able to transparently detect when it has reached the end of its internal buffer and read more text from the input file, and have all this be invisible to the user. If the user grammar can be written such that the "consumable" function prevents text from accumulating in memory, the Gnaw module should be able to parse an unlimited amount of text from a file by operating on an finite piece of it in RAM at any given moment.

Likewise, the Gnaw module should also be fairly easily convertable to supporting a pre-processor. This would also be accomplished by rewriting the __gnaw__at_end_of_string and/or __gnaw__move_pointer_forward subroutines to take text from a file, and preprocess it for comments and for compiler directives, and place the final text into the linked list.

Hopefully by the time I reach rev. 1.0, some basic form of these functions will be implemented so that users can see how it is done and so they can create their own preprocesors and so on.

I need to implement a "replace" function. It would look a lot like "capture", in that it would take a grammar command and a callback. But during a match, it would call the callback, take whatever string it returned, and use that to replace the string it matched in the original text.

I also need to implement some sort of text dumper subroutine. This would be called when the "consumable" function deletes text. Before it gets deleted, the consumable function would pass it to some subroutine, which could either ignore it and let it be deleted permanently from memory, or perhaps it writes the result in some file. This would then allow the grammar to parse an infinite amount of text and modify portions of it.

I also need some variations on the outer function "match", such as "parse" which would only parse from the beginning of the string. A "matchmany" function would go through the string and match as many times as it could, sort of like a "g" operator in perl regular expressions.

Case insensitivity is currently not supported. not sure how important that is. Maybe add it as an option that can be passed in to the "literal" function as a second parameter. not sure how to apply it globally.

AUTHOR

Greg London, <email at greglondon.com>

BUGS

Please note this is a beta release. Do not use this for production code. This is a proof-of-concept piece of code, and the actual interface is still completely up in the air. Do not create any massively long grammars with this parser as the parser rules themselves may change completely.

If you do find bugs, please be kind.

Please report any bugs or feature requests to bug-gnaw at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Gnaw. I will be notified, and then you will automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc Gnaw

You can also look for information at:

ACKNOWLEDGEMENTS

COPYRIGHT & LICENSE

Copyright 2008 Greg London, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.