NAME

MarpaX::G4 - Convert Antlr4 grammars to Marpa::R2 syntax.

Rationale

I observed that Marpa::R2 is quite a bit faster than Parse::RecDescent. The lack of anonymous quantified subrules however makes non-trivial grammars quite verbose since complex rules have to be manually split into the required subrules. For my use case (detailed in section 'Real life example : Translating a PL/SQL grammar') that was not feasible.

I decided to bootstrap an Oracle Marpa grammar from an Antlr4 grammar available on github by using Marpa to translate the Antlr4 grammar to Marpa::R2 syntax.

(I was working in an airgapped environment that provided Perl but not the option to install the antlr4 package in Python)

Synopsis

use File::Basename;
use Getopt::Std;
use MarpaX::G4;

use vars qw($scriptName);

BEGIN { $scriptName = basename($0); }

my $options = {};

die 'Invalid option(s) given'            if !getopts( $MarpaX::G4::optstr, $options );
MarpaX::G4::printHelpScreen($scriptName) if exists $options->{h};

my $grammartext =<<'INPUT';
grammar Exp;

eval        :    additionExp
            ;
additionExp :    multiplyExp ( '+' multiplyExp | '-' multiplyExp )*
            ;
multiplyExp :    atomExp ( '*' atomExp | '/' atomExp )*
            ;
atomExp     :    Number
            |    '(' additionExp ')'
            ;
Number      :    ('0'..'9')+ ('.' ('0'..'9')+)?
            ;
/* We're going to ignore all white space characters */
WS          :   (' ' | '\t' | '\r'| '\n')+ ->HIDDEN
            ;
INPUT

my $translator = MarpaX::G4->new();
$translator->translatestring( $grammartext, $options );

exit 0

with the option -c, -f and -k

lexeme default = latm => 1

:start          ::= eval

# ---
# Discard rule from redirect options :
:discard        ~   <discarded redirects>
<discarded redirects> ~   WS
# ---
eval            ::= additionExp

additionExp     ::= multiplyExp additionExp_002
additionExp_001 ::= '+' multiplyExp
                |   '-' multiplyExp
additionExp_002 ::= additionExp_001*

multiplyExp     ::= atomExp multiplyExp_002
multiplyExp_001 ::= '*' atomExp
                |   '/' atomExp
multiplyExp_002 ::= multiplyExp_001*

atomExp         ::= Number
                |   '(' additionExp ')'

Number          ::= Number_002 opt_Number_006
Number_001      ~   [0-9]
Number_002      ::= Number_001+
Number_003      ~   [0-9]
Number_004      ::= Number_003+
Number_005      ::= '.' Number_004
opt_Number_006  ::=
opt_Number_006  ::= Number_005

WS              ::= WS_002
WS_001          ~   [ \t\r\n]
WS_002          ::= WS_001+

Description

Overview

MarpaX::G4 translates an Antlr4 grammar (possibly spread over multiple files) to Marpa::R2 and writes it to a single output file.

This document provides a short tutorial on the usage of MarpaX::G4 and the potential pitfalls while using it.

Implementation

MarpaX::G4 is split into 3 packages :

  • Parser creates a Marpa::R2 parse tree from the antlr4 grammar file(s)

  • Symboltable manages the rules from the parse tree

  • MarpaGen generates a Marpa::R2 grammar from the rules

Usage

Calling MarpaX::G4::printHelpScreen with the script name returns the help screen :

parseg4.pl - Antlr4 to Marpa::R2 converter

Usage: parseg4.pl [-cdefghikprtuv] [-s <startsymbol>] [-o <outputfile>] <file1>[ <file2> ...]

-c                  strip all comments and actions (except inline actions)
-d                  dump the parse tree
-e                  embed the g4 inline actions into the marpa grammar
                    (default : prefix as comments ahead of the rule)
-f                  convert fragments to classes where applicable
-g                  convert lazy to greedy quantifiers
                    (CAVEAT: this might change the grammar semantics)
-h                  print this help and exit
-i                  ignore redirects (don't discard redirected rules)
-k                  build case-insensitive keywords from single-letter fragments
-o <outputfile>     specify the output file. default is stdout
-p                  strip inline comments and actions
-r                  verify the consistency of the symbol table
-s <startsymbol>    specify the start rule of the grammar
                    (default: 1st rule of the 1st input file)
-t                  trace the grammar generation
-u                  make literals and classes case-insensitive
-v                  dump the symbol table

Detailed Description of options

-c strip all comments and actions (except inline actions)

This input with standalone and embedded comments

/* This is a sample grammar */

A  : B C;
B  : 'Hello' # Symbol C contributes 'world!'
   ;
C  : 'World !'
   ;

gives this result :

lexeme default = latm => 1

:start ::= A

A ::= B C
B ~   'Hello'
C ~   'World !'

-d dumps the parse tree created by the parser to stdout.

MarpaX::G4 dumps the parse tree returned by the parser to stdout.

-e embed the g4 inline actions into the marpa grammar

If you are using MarpaX::G4 to translate a pseudo-antlr4 grammar where you already want to embed actions from a Marpa::R2 action package, you can use option -e to copy over the actions from source to destination. This example

A  : B C;

B  : 'Hello'    { do_action_b }
   ;
C  : 'World !'  { do_action_c }
   ;

gives this result :

lexeme default = latm => 1

:start ::= A

A ::= B C
B ~   'Hello'                            action =>  do_action_b
C ~   'World !'                          action =>  do_action_c

The actions have to be defined in a package that is added to the grammar with an 'action_object => '<action package' clause, where <action package> is the name of the package with the action sub's.

-f convert fragments to classes where applicable

Option -f can simplify certain grammars.

This example

A   : THIRTYTWOBITADDRESS+;

THIRTYTWOBITADDRESS     : HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT;
fragment HEX_DIGIT      : HEX_LETTER | DECIMAL_DIGIT;
fragment HEX_LETTER     : 'A' | 'B' | 'C' | 'E' | 'F';
fragment DECIMAL_DIGIT  : ('0'..'9');

gives this result :

lexeme default = latm => 1

:start              ::= A

A                   ::= A_001
A_001               ::= THIRTYTWOBITADDRESS+

THIRTYTWOBITADDRESS ::= HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
HEX_DIGIT           ~   HEX_DIGIT_001
HEX_DIGIT_001       ~   HEX_LETTER
                    |   DECIMAL_DIGIT

HEX_LETTER          ~   HEX_LETTER_001
HEX_LETTER_001      ~   'A'
                    |   'B'
                    |   'C'
                    |   'E'
                    |   'F'

DECIMAL_DIGIT       ~   DECIMAL_DIGIT_001
DECIMAL_DIGIT_001   ~   [0-9]

With option -f this is simplified to

lexeme default = latm => 1

:start            ::= A

A                 ::= A_001
A_001             ::= SIXTEENBITADDRESS+

SIXTEENBITADDRESS ::= HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
HEX_DIGIT         ~   HEX_DIGIT_001
HEX_DIGIT_001     ~   HEX_LETTER
                  |   DECIMAL_DIGIT

HEX_LETTER        ~   [ABCEF]

DECIMAL_DIGIT     ~   [0-9]

Yes, i'm aware that the 2 trailing classes can be merged into a single class, but i wanted to retain the original rules so i did not consider optimizations that would remove one of them.

-g convert lazy to greedy quantifiers

This input with a lazy quantifier in rule 'C'

A     : C;

B     : 'Hello'
      ;
C     : (B C)+?
      ;
D     : 'World !'
      ;

gives this result :

lexeme default = latm => 1

:start ::= A

A     ::= C
B     ~   'Hello'
C     ::= C_002
C_001 ::= B D
C_002 ::= C_001+

D     ~   'World !'

Without option -g the input would be rejected with the error message

lazy quantifier +? in rule C not supported by Marpa at MarpaX/G4/MarpaGen.pm line 86.

CAVEAT: transforming lazy into greedy quantifiers can make the generated Marpa::R2 grammar incompatible with the original Antlr4 grammar. If you run into problems, you'll have to manually tweak either the original or the generated grammar.

-h print the help screen and exit

-i ignore redirects (don't discard redirected rules)

This input with different redirects

A       : B C;

B       : 'Hello'   -> channel(DIRECTIVE)
        ;
C       : 'World !'
        ;

WS      : [\s]+      -> hidden
        ;
COMMENT : '#' [^\n]* -> skip
        ;

gives this result :

lexeme default = latm => 1

:start      ::= A

# ---
# Discard rule from redirect options :
:discard    ~   <discarded redirects>
<discarded redirects> ~   COMMENT
            |   WS
# ---
A           ::= B C
B           ~   'Hello'
C           ~   'World !'
WS          ::= [\s]+
COMMENT     ::= '#' COMMENT_001
COMMENT_001 ~   [^\n]*

Redirects to 'skip' and 'hidden' are mapped to the ':discard' directive. With option -i other redirects are ignored (i.e. the rules tagged with the redirects are added to the generated grammar). Without option -i they are also mapped to the ':discard' directive.

CAVEAT: Depending on the input grammar, MarpaX::G4 might map the redirected rules to G1 instead of L0 rules. You'll have to manually replace the '::=' operator with '~'. If you miss this, the grammar will not work as expected.

-k build case-insensitive keywords from single-letter fragments

In some grammars keywords are made case-insensitive by splitting them into terminals consisting of lower/upper case versions of a letter. The Marpa::R2 version of such a grammar could use the Marpa::R2 :i literal suffix. This example grammar

A       : B C;

B       : H E L L O
        ;
C       : W O R L D '!'
        ;

fragment D : ('d'|'D');
fragment E : ('e'|'E');
fragment H : ('h'|'H');
fragment L : ('l'|'L');
fragment O : ('o'|'O');
fragment R : ('r'|'R');
fragment W : ('w'|'W');

gives this result :

lexeme default = latm => 1

:start ::= A

A     ::= B C
B     ::= 'hello':i
C     ::= 'world!':i

-p strip inline comments and actions

If you want to discard the original antlr4 actions (and potentially add actions to the generated grammar) use option -p to discard the antlr4 actions.

A  : B C;

B  : 'Hello'    { do_action_b }
   ;
C  : 'World !'  { do_action_c }
   ;

gives this result :

lexeme default = latm => 1

:start ::= A

A ::= B C
B ~   'Hello'
C ~   'World !'

CAVEAT: The MarpaX::G4 rule for embedded actions is quite simplistic. If an embedded action uses features like curly braces embedded in literals or nested blocks, this rule will fail. In that case you have 2 options :

  • Preprocess the input grammar and remove the offending action

  • Tweak the MarpaX::G4::Parser '<Inline Action>' rule to successfully recognize this action.

-r verify the consistency of the symbol table

Option -r generates a report at the end of processing that lists the grammar rules in 2 sections : composite and basic rules.

A  : B C;

B  : 'Hello'    { do_action_b }
   ;
C  : 'World !'  { do_action_c }
   ;

gives this result :

===
=== Composite Rules
===

    +-------------------------------------------------------- rule name
 +--!-------------------------------------------------------- Fragment (F) or regular rule
 !  !                                              +--------- redirected (->) or contributing rule
 !  !                                              !   +----- number of rule references
 !  !                                              !   !   +- list of rule references
 !  !                                              !   !   !
 V  V                                              V   V   V
[ ][A                                            ][  ][ 2] B                C

===
=== Basic Rules
===

    +-------------------------------------------------------- rule name
 +--!-------------------------------------------------------- Fragment (F) or regular rule
 !  !                                              +--------- redirected (->) or contributing rule
 !  !                                              !   +----- n/a
 !  !                                              !   !
 V  V                                              V   V
[ ][B                                            ][  ][  ]
[ ][C                                            ][  ][  ]

lexeme default = latm => 1

:start ::= A

A ::= B C
#  do_action_b
B ~   'Hello'
#  do_action_c
C ~   'World !'

-s [startsymbol] specify the start rule of the grammar

MarpaX::G4::MarpaGen creates the output grammar by computing the convex hull of the start symbol. It does that by collecting the referenced rules of the start symbol, then collecting the referenced rules of these and so on until we only have basic rules left and the process terminates.

By default, the start symbol is the first symbol from the first input file. Option -s overrides this default. You may want to do that for 2 reasons :

  • The start symbol of the grammar is NOT the first symbol in the first input file.

  • You are interested in a subset of the original grammar (e.g. you are interested in a grammar for CREATE TABLE but not all statements supported by Oracle)

Processing the grammar below with option -s B

A  : B C;

B  : 'Hello'
   ;
C  : 'World !'
   ;

gives this result :

WARNING: the rules listed below are orphaned. They can't be reached from the start rule:
=======:

       +----------------------------- rule name
    +--!----------------------------- Lexical (L) or parser rule
 +--!--!----------------------------- Fragment (F) or regular rule
 !  !  !
 V  V  V
[ ][ ] A
[ ][ ] C


lexeme default = latm => 1

:start ::= B

#  do_action_b
B ~   'Hello'

-t trace the grammar generation

Option -t enables a trace mode for the actions embedded in the grammar in MarpaX::G4::Parser. This is quite useful in debugging scenarios. In regular operations you can ignore this option.

-u make literals and classes case-insensitive

-u is useful for generating grammars where lexemes should be recognized irrespective of case. Processing the grammar below

A   : BINDVAR+;

BINDVAR
    : ':Prefix:' SIMPLE_LETTER  (SIMPLE_LETTER | SINGLE_DIGIT | '_')*
    ;

fragment SIMPLE_LETTER  : [A-Z];
fragment SINGLE_DIGIT   : ('0'..'9');

gives this result :

lexeme default = latm => 1

:start           ::= A

A                ::= A_001
A_001            ::= BINDVAR+

BINDVAR          ::= ':prefix:':i SIMPLE_LETTER BINDVAR_002
BINDVAR_001      ::= SIMPLE_LETTER
                 |   SINGLE_DIGIT
                 |   '_'
BINDVAR_002      ::= BINDVAR_001*

SIMPLE_LETTER    ~   [A-Z]:ic
SINGLE_DIGIT     ~   SINGLE_DIGIT_001
SINGLE_DIGIT_001 ~   [0-9]

-v dump the symbol table

Option -v dumps the symbol table created by the rules imported from the parse trees of the input files. This is useful for debugging, not much else.

Real life example : Translating a PL/SQL grammar

I wanted to use PlSqlLexer.g4 and PlSqlParser.g4 from https://github.com/antlr/grammars-v4/tree/master/sql/plsql.

Executing the translator with a number of command line options returned the Marpa grammar in the output file plsql.mx.

parseg4_from_file.pl -u -p -s sql_script -o plsql.mx ./grammars-v4-master/sql/plsql/PlSqlParser.g4 ./grammars-v4-master/sql/plsql/PlSqlLexer.g4

Applying this grammar to an Oracle package definition

testmarpagrammar_from_file.pl -g ./plsql.mx ./data/package_definition.sql

returned annoying errors when Marpa refused to accept the rules

UNSIGNED_INTEGER
DELIMITED_ID

In order to overcome these errors, i had to manually tweak the generated grammar as per the 'diff' output below :

10616c10616,10620
< UNSIGNED_INTEGER               ~   [0-9]+
---
>
> UNSIGNED_INTEGER               ~   MULTIPLE_DIGITS
> MULTIPLE_DIGITS                ~   SINGLE_DIGIT+
> SINGLE_DIGIT                   ~   [0-9]
>
10695,10696c10699,10700
<                                |   ':' DELIMITED_ID
<                                |   ':' UNSIGNED_INTEGER
---
>                                |   ':"' DELIMITED_ID_003 '"'
>                                |   ':' MULTIPLE_DIGITS
10699c10703
<                                |   [0-9]
---
>                                |   SINGLE_DIGIT
10711c10715
< SINGLE_LINE_COMMENT            ~   '--' SINGLE_LINE_COMMENT_002 NEWLINE_EOF
---
> SINGLE_LINE_COMMENT            ~   '--' SINGLE_LINE_COMMENT_002
10717c10721
< REMARK_COMMENT                 ~   'rem':i opt_REMARK_COMMENT_001 opt_REMARK_COMMENT_005 NEWLINE_EOF
---
> REMARK_COMMENT                 ~   'rem':i opt_REMARK_COMMENT_001 opt_REMARK_COMMENT_005
10725c10729
< PROMPT_MESSAGE                 ~   'pro':i opt_PROMPT_MESSAGE_001 opt_PROMPT_MESSAGE_005 NEWLINE_EOF
---
> PROMPT_MESSAGE                 ~   'pro':i opt_PROMPT_MESSAGE_001 opt_PROMPT_MESSAGE_005
10733c10737
< START_CMD                      ~   '@' opt_START_CMD_001 START_CMD_003 NEWLINE_EOF
---
> START_CMD                      ~   '@' opt_START_CMD_001 START_CMD_003
10743c10747
<                                |   [0-9]
---
>                                |   SINGLE_DIGIT
10747c10751
< NEWLINE_EOF                    ~   NEWLINE
---
>
10751c10755
< FLOAT_FRAGMENT_001             ~   UNSIGNED_INTEGER*
---
> FLOAT_FRAGMENT_001             ~   SINGLE_DIGIT*
10754c10758
< FLOAT_FRAGMENT_003             ~   UNSIGNED_INTEGER+
---
> FLOAT_FRAGMENT_003             ~   MULTIPLE_DIGITS

The changes to the comment definitions are apparently due to the newline characters being processed by the :discard rule, so the NEWLINE_EOF parts are redundant. The changes to FLOAT_FRAGMENT are bug fixes to the original grammar since adding quantifiers on top of the original UNSIGNED_INTEGER makes no sense.

Why Marpa would refuse to accept the UNSIGNED_INTEGER and DELIMITED_ID symbols in the original grammar is completely unclear to me. The changes i applied made Marpa accept the grammar but they are not intuitive.

Driver scripts for MarpaX::G4

The scripts

  • parseg4_from_file.pl

  • testmarpagrammar_from_file.pl

are driver scripts for MarpaX::G4. parseg4_from_file.pl reads an antlr4 grammar from an (or multiple) input file(s) and writes a Marpa::R2 grammar to output. testmarpagrammar_from_file.pl reads a Marpa::R2 grammar from an input file, applies it to test data from a 2nd input file and writes the parse tree to output.

The versions

  • parseg4.pl

  • testmarpagrammar.pl

take their input from HERE documents, so they are easier to use in a setting where you want to debug smaller samples with an IDE.

Executing testmarpagrammar.pl for this input

2*((5+2)*3)

gives the output below (after replacing the WS :discard rule with WS ~ [\s]+)

[
  [
    2,
    [
      "*",
      [
        "(",
        [[["(", [[5, []], ["+", [2, []]]], ")"], ["*", 3]], []],
        ")",
      ],
    ],
  ],
  [],
]

Another version of the answer to the ultimate question from THGTTG.

Known Bugs + Features

  • ANTLR4 supports generic regex expressions in its lexemes. If you try to translate a grammar that uses such expressions, translation will fail.

  • Mapping lazy to greedy quantifiers might break your grammar.

  • Antlr4 supplies an internal EOF symbol that does not have to be declared in the grammar. MarpaX::G4 creates an EOF lexeme "EOF ~ [\z]" when it encounters a reference to EOF. This might or might not work for you.

  • Redirected rules that go into the :discard section must be lexical rules. You might have to tweak the generated grammar post-translation to enforce this.

Author

Axel Zuber

Support

MarpaX::G4 comes without warranty. Support is provided on a volunteer basis through the standard mechanisms for CPAN modules.

Copyright and License

Copyright 2022 Axel Zuber
This file is part of MarpaX::G4. MarpaX::G4 is free software: you can
redistribute it and/or modify it under the terms of the GNU Lesser
General Public License as published by the Free Software Foundation,
either version 3 of the License, or (at your option) any later version.

MarpaX::G4 is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
Lesser General Public License for more details.

You should have received a copy of the GNU Lesser
General Public License along with MarpaX::G4.  If not, see
http://www.gnu.org/licenses/.