Parrot Grammar Engine (PGE)

This is a regular expression/rules/grammar engine/parser designed to run in Parrot. It's still a work in progress, but has a lot of nice features in it, including support for perl 6 regexes, globs, shift-reduce parsing, and some support for perl 5 regular expressions. It also includes the "pgc.pir" grammar compiler, which can convert an entire grammar specification into the appropriate PIR code for execution.

A nice feature of PGE is that one can easily combine many different parsing styles into a single interface. PGE uses perl 6 rules for its top-down parsing, an operator precedence parser for bottom-up (shift/reduce) parsing, and allows control to pass freely between the two styles as well as to custom parsing subroutines.

Installation

PGE assumes that it is part of the parrot distribution in the compilers/pge directory. Simply type make in this directory to build the various *.pbc files and install them into runtime/parrot/library.

The distribution comes with a small demo.pir program that gives an example of using PGE. To run the demo, simply do parrot demo.pir. The demo understands the following commands:

rule pattern      - compile a Perl 6 rule from "pattern"
save name         - save the current rule as "name"
text              - a text string to match against previously entered rule
pir               - display the PIR code generated for current rule
exp               - display the expression tree for the current rule
trace             - toggle pattern execution tracing
next              - repeat last match on target string

PGE's rule engine (PGE::Perl6Regex)

Once PGE is compiled and installed, you generally load it using the load_bytecode operation, as in

load_bytecode 'PGE.pbc'

This imports the PGE::Perl6Regex compiler, which can be used to compile strings of Perl 6 regexes. A sample compile sequence would be:

.local pmc p6regex_compile
p6regex_compile = compreg 'PGE::Perl6Regex'         # get the compiler

.local string pattern
.local pmc rulesub
pattern = '^(From|Subject)\:'                  # pattern to compile
rulesub = p6regex_compile(pattern)             # compile it to rulesub

Then, to match a target string we simply call the subroutine to get back a PGE::Match object:

.local pmc match
$S0 = 'From: pmichaud@pobox.com'               # target string
match = rulesub($S0)                           # execute rule

The Match object is true if it successfully matched, and contains the strings and subpatterns that were matched as part of the capture. Parrot's "Data::Dumper" can be used to quickly view the results of the match:

  load_bytecode 'dumper.pir'
  load_bytecode 'PGE/Dumper.pir'

match_loop:
  unless match goto match_fail                   # if match fails stop
  print "match succeeded\n"
  _dumper(match)
  match.'next'()                                 # find the next match
  goto match_loop

match_fail:
  print "match failed\n"

One can also get the intermediate PIR code that PGE generates for the rule subroutine -- just use

$S0 = p6regex_compile(pattern, 'target'=>'PIR')

and you can print/inspect the contents of $S0 to see the generated code.

See the STATUS file for a list of implemented and yet-to-be-implemented features.

Known limitations of the rule engine

PGE doesn't (yet) properly handle nested repetitions of zero-length patterns in groups -- that's coming soon.

Many well-known optimizations (e.g., Boyer-Moore) aren't implemented yet, although a variety of optimizations are being added as we generate code.

Lastly, error handling needs to be improved, but this will likely be decided as we discover how PGE integrates with the rest of Parrot.

Implementation notes

Basically, PGE is a compiler just like any other, except that its "language" is the Perl 6 regex syntax and its output is a subroutine that can match strings. So, PGE consists of a series of parsers (for each pattern matching language), an intermediate expression format, and a code generator.

The parsers can be written using PIR subroutines or PGE's built-in operator precedence (shift/reduce) parser; the parser for Perl 6 regexes is built with the operator precedence parser. This parser produces a parse tree (in the form of a Match object) for a given Perl 6 regex. The parse tree then goes through semantic analysis and reduction phases before being sent to code generation to produce a PIR subroutine.

The generated PIR code uses bsr/ret for its internal backtracking (optimized for tailcalls) and uses Parrot calling conventions for all interfaces with external callers/callees such as subrules.

PGE also uses Parrot coroutines for the matching engine, so that after a successful match is found, the next match within the same string can be found by simply returning control to the matching coroutine, which then picks up from where it had previously left off until another match is discovered.

The code still needs a fair amount of commenting. In general, if you have a question about a particular section of code, send Pm an email and he'll write the comments for it.

AUTHOR

Patrick Michaud (pmichaud@pobox.com) is the author and maintainer. Patches and suggestions should be sent to the Perl 6 compiler list (perl6-compiler@perl.org).