The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Mop - Recursive descent parser generator (Alpha 1.06).

SYNOPSIS

            use Rule;
            use Lex;
            @tokens = (
               'addop' => '[-+]', 
               'leftp' => '[(]',
               'rightp' => '[)]',
               'integer' => '[1-9][0-9]*',
              );
            $reader = Lex->new(@tokens);


            $EXPR = And->new(\($FACTOR, Any->new(\$addop, \$FACTOR)),
                    sub { 
                      shift(@_);
                      my $result = shift(@_);
                      my ($op, $integer);
                      while ($#_ >= 0) {
                        ($op, $integer) = (shift(@_), shift(@_));
                        if ($op eq '+')  {
                          $result += $integer;
                        } else {
                          $result -= $integer;
                        }
                      }
                      $result;
                    });
            $FACTOR = Or->new(\$integer, \$PAREXP);
            $PAREXP  = And->new(\$leftp, \$EXPR, \$rightp),
                sub { $_[2] });


            print "result: ", $EXPR->next, "\n";

DESCRIPTION

Creating parsers by hand is tedious even for simple languages. This activity can be automated by parser-generators - yacc is a well-known example. But using such tools is quite demanding and requires a reasonable knowlege of the principles of syntactic analysis.

We offer here a set of Perl5 modules which allow the generation of recursive descent parsers for context-free grammars. Mop has three basic packages: Rule, Lex and Token which are object-based. The use of these packages presupposes that you know how to write a BNF grammar and that you know (just a little) about programming in Perl.

Specifying the parser does not require any extension to Perl syntax. The specification is carried out entirely in standard Perl, be it definition of tokens, syntactic rules or associated semantic actions. Mop allows the easy specification of translation schemes, that is parsers for which the semantic action is given by actions directly associated with each production.

The packages Token and Rule allow respectively the definition of objects corresponding to terminals (tokens) and non-terminals of the grammar. Lex handles the reading and "eating" of tokens in the input stream.

Before using these packages you need to define a BNF grammar without left recursion (an LL(1) grammar). Given this, making the parser consists in:

1. create a lexical analyser by specifying the terminals,

2. create a parser (syntactic analyser) by creating a Rule object (or, more precisely, one of the packages which inherits from Rule) for each non-terminal.

3. define the semantics by associating an anonymous function with each Rule object.

Take as an example arithmetic expressions having only + and - as operators. In the Camel book we find the following grammar:

            EXPR ::= FACTOR { ADDOP FACTOR }
            ADDOP ::= '+' | '-'
            FACTOR ::= NUMBER | '(' EXPR ')'

Creating the parser for this language involves defining a lexical analyser and a syntactic analyser.

The lexical analyser is defined thusly:

            @tokens = (
               'addop' => '[-+]', 
               'leftp' => '[(]',
               'rightp' => '[)]',
               'integer' => '[1-9][0-9]*',
              );
            $reader = Lex->new(@tokens);

The argument of the method new() is a list of pairs: the identity of the terminal and the corresponding regular expression. Each such pair leads to the creation of a terminal of type Token.

The package Rule is the base package of a set : And, Any, Do, Hook, Opt, Or. These packages allow the creation of the different types of rules normally found in context-free grammars. We use a prefix notation with the following equivalences.

      A | B     Or->new(\$A, \$B)    symbol A or symbol B


      A B       And->new(\$A, \$B)   symbol A followed by symbol B


      { A }     Any->new(\$A)        arbitrary number of  A


      [ A ]     Opt->new(\$A)        zero or one occurrence of A

The rules in our example are given by creating the following objects:

            $EXPR = And->new(\($FACTOR, Any->new(\$ADDOP, \$FACTOR));
            $FACTOR = Or->new(\$NUMBER, \$PAREXP);
            $PAREXP  = And->new(\$LEFTP, \$EXPR, \$RIGHTP);

The arguments of the method new() are references to Rule or Token objects. (The written order of the rule has no significance, since scalars can be references before they are assigned a value. These references are resolved when each object is used. As the example shows, references can be obtained to the objects returned by a rule.)

The semantics is defined by putting an anonymous function at the end of the list of object references. The anonymous function uses information associated with the objects. This information is transmitted by positional parameters (the array @_). The nth argument designates the result of the nth parameter of the method new(). Information returned by the function is associated with the object and is transmitted by means of positional parameters wherever the object is used. In our example we will have:

            $EXPR = And->new(\($FACTOR, Any->new(\$addop, \$FACTOR)),
                    sub { 
                      shift;
                      my $result = shift;
                      my ($op, $integer);
                      while ($#_ >= 0) {
                        ($op, $integer) = (shift, shift);
                        if ($op eq '+')  {
                          $result += $integer;
                        } else {
                          $result -= $integer;
                        }
                      }
                      $result;
                    });
            $FACTOR = Or->new(\$integer, \$PAREXP);
            $PAREXP  = And->new(\$leftp, \$EXPR, \$rightp),
                sub { $_[2] });


            print "result: ", $EXPR->next, "\n";

When an integer is recognised it is returned by the anonymous function associated with the object $FACTOR. This returned information (or, more precisely, synthesised, since it comes from a a terminal and is transmitted to non-terminals) is also available in the anonymous function associated with the object $EXPR. The information returned by the following object is used to calculate the value of the arithmetical expression.

The analyser is started up by applying the method next() to the axiom of the grammar:

            $EXPR->next;

By default the input for analysis is read from the standard input. The example parser analyses and interprets individual expressions typed in at the terminal. The example calculator.pl delivered with the Rule package shows how to create an input loop allowing reading and interpretation of arbitrary many expressions.

The parser generator can be used for other purposes than the analysis of a character stream. Given that the packages Lex, Rule and Token, it is perfectly possible to define terminals which are objects i.e. instances of a class other than Token. Each new package defining terminals ought at least to have the methods status() and next() - see vonkoch.pl as an example.

PACKAGE RULE

The objects which represent grammar rules are composite objects. The are groupings of Token objects (terminals) and one of the six following non-terminal types: And, Do, Any, Hook, Opt et Or.

Formally, a context-free grammar can be seen as a graph whose nodes are non-terminals and whose leaves are terminals. The semantics is given by associating functions with these nodes: one of the functions is associated with before the sub-nodes are examined, the other afterwards - assuming that the examination is successful.

A parser uses synthesised and inherited information. The first is passed up the graph from the termials to the non-terminals; the second descends from non-terminals to terminals. The function attached to graph nodes can modify this information.

Henceforth the status of the object refers to whether the associated exploration has succeeded or not. As an example, the status of a node Or is true if at least on of its component subnodes has status true.

Attributes and Anonymous Functions

Information can be transmitted and modified during the graph traversal. This information is of two types: inherited attributes and synthesised attributes. A synthesised or inherited attribute can be any Perl data structure.

These attributes can be modified by the function which is executed when the node is reached. Inherited attributes are available in the argument array @_ as well as @_h (note that $_[0] contains the object created by the rule).

The functions defining the rule semantics access synthesised and inherited attributes. Synthesised attributes are available in @_ as well as @_s.

Synthesised attributes are returned by the method next(). This method returns whatever the semantic function for the graph node returns. In the absence of a semantics the synthesised attributes are returned untouched.

Objects attached to a node of type And or of type Any (brother nodes) transmit synthesised information from left to right.

The object corresponding to a rule is the first argument of the anonymous functiona associated with this rule.

Types of Objects

And

And defines an object composed of a sequence of objects (terminals and/or non-terminals). An object of type And has status true if all its component objects are themselves true.

Any

Any takes as argument a list of objects. This list is traversed so long as the examination of each oject (terinal or non-terminal) succeeds. An object of type Any always has status true.

Do

Define an action anywhere in a parser production. Useful for error handling.

Hook

Hook Attach anonymous functions to an object. The first argument must be an object reference, the second the semantic function executed if the object status is true. The third argument is an anonymous function which will always be executed before the examination of the object.

Opt

Opt takes a list of objects (terminals or non-terminals). The list is inspected onces. If all the list objects are true, the semantic action is carried out. An object of type Opt is always true.

Or

Or allows alternatives. The first arguemnt is a list of objects (terminals and/or non-terminals). An object of type Or is true if at least one of its component objects is true.

Methods

new()

All the objects of the types mentionned are created by the method new(). This method takes a list of object references, possibly followed by one or two anonymous functions. The first defines the semantics, the second handles inherited attributes.

next()

A rule is activated by the method next() (which can be viewed as the graph exploration engine). It returns the result from the rules's semantic function. It passes up information from rule to rule i.e. from the terminals to the axiom passing by the non-terminals. Thus it synthesises attributes.

status()

Indicates whether the last object search has succeeded or failed.

probe()

Allows rudimentary tracing of the function associated with an object. So, for example, one can write:

            $EXPR = And->new(...,
                             sub { 
                                  $self = shift;
                                  $self->probe("EXPR @_");
                             });

PACKAGE LEX

The package Lex allows the definition of lexical analysers. It handles reading and eating of the data.The method from() allows you to specify an input filehandle.

The lexical analyser recognises tokens defined by regular expressions given as a parameter to the method new(). These regexs are examined in the order in which they are given in the parameter. Regex search is anchored to the beginning of the buffer.

Methods

debug()

Activate/disactivate a trace indicating which tokens have been eaten.

chomp()

Active/disactivate the removal of the newline character for each input line.

eof()

Return true if the end of file is encountered.

from()

Indicate the data source, The argument is either a string or a reference to a filehandle. For example:

            $symbol->from(\*DATA);

or

            $symbol->from('les données à analyser');
less()

The argument is an expression whose value is put at the start of the data stream.

new()

Create a new anlayser. The argument is a list of triples consisting of: the symbolique name of the token, the regular expression for its recognition and possibly an anonymous function executed when the token is recognised. new() creates an object of type Token for each triple.

reset()

Empty Lex's internal buffer.

buffer()
buffer(EXPR)

Return the contents of Lex's internal buffer. With an expression as argument, put the result of the expression in the buffer.

singleline()

If active read only a single line.

skip(RE)

Define the lexeme separator (default: [ \t]+).

readline()

Read data from the specified input (see method from()). Return the result of the read.

token()

Return the object corresponding to the last token consumed. In the absence of a such, return a special token whose symbolic name is default token.

PACKAGE TOKEN

The package Token allows the definition of tokens used by the lexical analyser. Objects of this type are created by the method new() of the package Lex.

Methods

debug()

Activate/disactivate a trace showing which tokens have been found.

mean()

Return the anonymous function associate with the object Token.

name()

Return the symbolic name of the object.

next()

Read, consume and return the token defined by the regular expression in the object.

new()

Create an object of type Token. The arguments of new() are ordered: a symbolic name, a regular expression, and (optionally) an anonymous function. The anonymous function is executed when the token is consumed by the lexical analyser. The output of this function defines the string of characters memorised in the object and accessible by the method get().

regexp()

Return the regular expression used for token recognition.

status()

Indicate is the last token search has succeeded or not.

ERROR HANDLING

To handle cases where tokens are not recognised you can define a specific Token object e.g.

            $ERROR = Token->new('.*');

If search for this token succeeds it is then possible to call an error function.

For syntactically incorrect expression you can use a object of type Do (see arithm3.pl).

EXAMPLES

Using grammar-based programs allows the separation of the description of a structure and the description of the function of this structure. This enhances modularity, clarity and evolutivity. The following supplied examples attempt to indicate this claim:

arithm1.pl - Interpreter for arithmetic expression consisting of integer additions and subtractions. arithm1.pl is used from the terminal. It returns the input expression and then halts. There is no error reporting.

arithm2.pl - Interpreter for arithmetic expressions having the 4 standard operators for reals.

arithm3.pl - Improved version of "arithm2.pl". Includes read loop and error messages.

calculator.pl - super-simple calculator for addition and subtraction. If you type a number followed by ENTER, it is printed out with a preceeding equal sign. The numbers which you now type are added or subtracted to this number, depending on whether they are positive or negative. Reinitialisation occurs when you type = number or = alone. This example shows how you can supply the parser with a user interaction loop. (The example is bases on Mason and Brown in Lex & Yacc) tokenizer.pl - Shows tokenisation using the package Lex.

vonkoch.pl - For those having access to Tkperl. Draws out a von Koch curve.

OPIMITISATIONS

No doubt lots are possible.

Productions are allowed in a rule Or having a shared prefix. This possibility introduces an indeterminism which has an associated cost. The generator can be optimised by removing this indeterminism (by suppressing all the methods defined in the package BufferList)

EXTENSIONS

Note that Mop is an alpha version and thus subjet to change.

Many extensions are possible. Use of Mop in different contexts should help us to find interesting extensions.

AUTEURS

Philippe Verdret, 1995

BUGS

REFERENCES

Groc, B., & Bouhier, M. - Programmation par la syntaxe. Dunod 1990.

Mason, T & Brown, D. - Lex & Yacc. O'Reilly & Associates, Inc. 1990.

9 POD Errors

The following errors were encountered while parsing the POD:

Around line 252:

'=item' outside of any '=over'

Around line 290:

You forgot a '=back' before '=head2'

Around line 293:

'=item' outside of any '=over'

Around line 329:

You forgot a '=back' before '=head1'

Around line 345:

'=item' outside of any '=over'

Around line 371:

Non-ASCII character seen before =encoding in 'données'. Assuming CP1252

Around line 425:

You forgot a '=back' before '=head1'

Around line 433:

'=item' outside of any '=over'

Around line 470:

You forgot a '=back' before '=head1'