NAME
MarpaX::ESLIF::Introduction - MarpaX::ESLIF's Introduction
VERSION
version 5.0.4
DESCRIPTION
MarpaX::ESLIF is a Scanless Interface expressed in a BNF format, that is using marpaWrapper, itself being a thin interface on top of libmarpa parser.
The MarpaX::ESLIF::BNF is inspired from Marpa::R2's DSL, though with some incompatible changes and add-ons:
- Native regular expression support, including regex callbacks into user-space
- Syntactic exception
- Unlimited number of sub-grammars
- Streaming compatible architecture
- Zero-length symbols
- Embedded lua language into the grammar
- Bindings to java, lua and perl languages
- JSON (strict and extended) grammars for decoding/encoding
The following sections are a general presentation of MarpaX::ESLIF architecture, features, things-to-know.
EXAMPLE
Please look at MarpaX::ESLIF::Tutorial::Calculator.
ARCHITECTURE
Grammars
The ESLIF is nothing else but a sparse array of grammars, identified by an indice called level and starting with value 0, or a description:
[------------------------------------------------------------------]
| Indice | Level 0 | N/A | Level 2 | Level 3 | N/A | Level 5 | ... |
| Name | nameof0 | N/A | nameof2 | nameof3 | N/A | nameof5 | ... |
[------------------------------------------------------------------]
There must be a grammar at level indice 0. Then any grammar can access any symbol of any other grammar:
[-------------------------------------------------------------------]
| Indice: | Level 0 | N/A | Level 2 | Level 3 | N/A | Level 5 | ... |
| Name : | nameof0 | N/A | nameof2 | nameof3 | N/A | nameof5 | ... |
[-------------------------------------------------------------------]
| Symbol: | +>X +>X +>Xx |
| Symbol: | | Y | Y | Yy |
| Symbol: | | | | | | | +>Zzz |
| | | | | | | | | | |
| | |____________| |______| |_____________| | | |
| |______________________________________________| | |
[-------------------------------------------------------------------]
If we note a symbol in the form S[i], meaning symbol S of grammar level i, then the schema above say that Y[0] is a reference to X[2], that Y[2] is a reference to Xx[3], that Yy[3] is a reference to Zzz[5], and that Zzz[5] is a reference to Y[0]. Any symbol of any grammar that is accessed via a reference is considered being part of a lexing phase, and the user will have no control until this phase is over, this symbol being recognized or not.
This is why it is required that grammar at level 0 exist: it is considered by the author a common practice that the top level grammar should be at level 0. Though technically this is not absolutely required -; In fact, it is possible to start parsing by specifying another grammar but the one at level 0 as a starting grammar.
Recognizer
Top recognizer
The lifetime of parsing, for a given location in the top-level grammar, consist by assigning a set of symbols (these are called alternatives), commiting them (we say that we complete the set of alternatives), and move on. A Scanless Interface mean that you do not have to write your own analysis of input: grammars give definition of what is expected, and the interface have the possbility to determine all the alternatives for you, commiting them, and move on in the input stream. We say input stream: this is another dimension (we suppose from now on that the top-level grammar is at level 0):
#
# Note
#
# ::= is an alias for grammar level 0
# ~ is an alias for grammar level 1
# :[n]:= is the generic form for grammar level n
#
[---------------------------------------------------] STREAM MANAGEMENT
| Rule is X ::= x y |
[---------------------------------------------------] STEP 0
| Location is start of rule X[0]: |
| X ::= . x y |
| Suppose that expected "terminals" are T1 and T2: |
[---------------------------------------------------] STEP 1
| Try to match T1 |
| Nothing yet in the stream ? |<-----> Stream reader callback
| T1 may match but we are not sure |<-----> Stream reader callback
| Repeat until T1 matches for sure or not |
[---------------------------------------------------] STEP 2
| Try to match T2 |
| T2 may match but we are not sure |<-----> Stream reader callback
| Repeat until T2 matches for sure or not |
[---------------------------------------------------]
| No match ? End of scanning | STEP 3
| Match ? Commit T1 and T2 and continue |
[---------------------------------------------------]
Sub-recognizers
The stream management mentionned above is transversal to any grammar: As soon as "terminal" is in reality a referenced symbol, a sub-recognizer is instanciated and it is sharing the stream with is parent:
TOP RECOGNIZER ON GRAMMAR LEVEL 0
[---------------------------------------------------] STREAM MANAGEMENT
| Rule is X ::= x y |
[---------------------------------------------------] STEP 0.0
| Location is start of rule X[0]: |
| X ::= . x y |
| Suppose that expected "terminals" are T1 and T2: |
[---------------------------------------------------] STEP 0.1
| Try to match T1 |
| Nothing yet in the stream ? |<-----> Stream reader callback
| T1 may match but we are not sure |<-----> Stream reader callback
| Repeat until T1 matches for sure or not |
[---------------------------------------------------] STEP 0.2
| Try to match T2 |
| T2 is a referenced symbol in grammar n |
[---------------------------------------------------]
SUB-RECOGNIZER ON GRAMMAR LEVEL n
[-------------------------------------------------]
| Rule is T2 :[n]:= a b |
[-------------------------------------------------] STEP 1.0
| Location is start of rule T2[n]: |
| T2 :[n]:= . a b |
| Suppose that expected "terminals" are U1 and U2:|
[-------------------------------------------------] STEP 1.1
| Try to match U1 |
| Nothing yet in the stream ? |<-----> Stream reader callback
| U1 may match but we are not sure |<-----> Stream reader callback
| Repeat until U1 matches for sure or not |
[-------------------------------------------------] STEP 1.2
| Try to match U2 |
| U2 may match but we are not sure |<-----> Stream reader callback
| Repeat until U2 matches for sure or not |
[-------------------------------------------------]
| No match ? End of scanning for T2[n] | STEP 1.3
| Match ? Commit U1 and/or U2 and continue |
[-------------------------------------------------]
| Do internal valuation | STEP 1.4
[-------------------------------------------------]
BACK TO TOP RECOGNIZER ON GRAMMAR LEVEL 0
[---------------------------------------------------]
| No match ? End of scanning | STEP 0.3
| Match ? Commit T1 and/or T2 and continue |
| If T2 matches it is a parse tree value |
[---------------------------------------------------]
And this is recursive: there will as many sub-recognizers instanciated as there are sub-grammars involved. For instance if terminal U2
above is a referenced symbol at grammar level l
, a second sub-recognizer will be instanced by the first sub-recognizer. Every child recognizer is sharing all needed transveral information, that is everything about stream management. The main difference between the top recognizer and any child recognizer is that a child recognizer is always doing an internal valuation to retreive the span in the input stream for, and give that back to its parent.
The internal valuation is a forced mode that is concatenating all matched bytes in the input stream.
Discard and sub-recognizers
You might say, why explicitely doing an internal valuation: the match is where sub-recognizer started and where it ended. No, because any grammar can have it own discard mechanism. This mean that what a sub-recognizer matched may be shorter than the number of bytes effectively consumed from the input stream. So, we have just introduced the notion of discard:
discard is yet another symbol in any grammar, but with special semantic, and its name in the BNF is always :discard
. For example:
:discard ::= whitespace
whitespace ~ /[\s]+/
mean that grammar at level 0 always try to match the whitespace
symbol when it failed to match any of the expected terminals.
As soon as there is no match, and if :discard
rule exist, any recognizer is always trying to get a match on it using a sub-recognizer, exactly like when it is executing a sub-recognizer for a terminal referencing a symbol in another grammar. Furthermore nothing distinguishes the special symbol :discard
from the others: it can also reference any symbol in any other sub-grammar. Though there is a major difference between discard sub-recognizers and terminal sub-recognizers: a discard sub-recognizer will never instanciate another discard sub-sub-recognizer. This mean that in the following:
:discard ::= whitespace
:discard ~ somethingelse
whitespace ~ /[\s]+/
somethingelse ~ 'a string'
if a discard tentative is instancuated on grammar at level 0 using the symbol whitespace
of level 1, and if whitespace
of level 1 does not match, there will be no tentative for try to discard in level 1, even it is has a :discard
rule that is defined to be somethingelse
.
STREAMING, CHARACTER AND BINARY MODES
Everytime any recognizer need more data, a callback to userspace is triggered. It is legal to not give the encoding when it is a character stream, then the engine will guess (the user should give enough data so that the guess is correct, though).
Internally, all chunks of characters are converted to UTF-8. This guarantees three things:
- Validation of well-formed characters
- uniform internal processing
- native compatibility with the regular expression engine
A recognizer always owns a stream, the later is shared in two cases:
- Lexeme search
-
A sub-recognizer is started, and it shares the stream with its parent. Nevertheless parent stream is guaranteed to never crunch any data until this sub-recognizer finishes. At most, new data may be appended. When this sub-recognizer finishes, it updates the parent position in the stream if the lexeme it was looking for is found. The end-user never has control on such sub-recognizer.
-
The end-user can create a new top-level recognizer that shares the stream with another top-level recognizer. Then, it is guaranteed that everytime one of them updates its stream, the other's stream changes the same way.
TERMINALS AND REGULAR EXPRESSIONS
As mentionned above, regular expression are totally handled using PCRE2. Therefore the syntax of regular expression is the PCRE2 syntax. It is obvious that a regular expression define an internal "terminal", and there are three ways to define such a terminal, all of them being converted to a regular expression:
Each of these three terminal types support eventual modifiers. The most central modifier is the need or not of having the notion of "valid characters", especially outside of the ASCII range. This is called the PCRE2_UTF flag, and is mentionned thoroughly in the next sections.
- String
-
A string is delimited expression in the grammar, where allowed start/and delimiters are
''
and""
. When a string is recognized in the grammar, escaping is allowed using the backslash\
character, and only the start delimited or backslash itself can be escaped. Absolutely any other character is takenas is
, eventually internally escaped by MarpaX::ESLIF to remove its signification in PCRE2, when there is one. For example:- 'Example'
-
is translated to the UTF-8 pattern
Example
- '{Example}'
-
is translated to the UTF-8 pattern
\{Example
- "{Example}"
-
is translated to the UTF-8 pattern
\{Example
- '{Example[]\}'
-
will trigger an error because only
'
or\
itself can be backslashed. - '{Example[]\\}'
-
is translated to the UTF-8 pattern
\{Example\[]\\}
- 'Black Heart Suite Character: ♥'
-
is translated to the UTF-8 pattern
Black Heart Suite Character: \x{2665}
A string is always scanned character per character by MarpaX::ESLIF, and an ASCII compatible pattern is generated, using \x{...} codepoint notation whenever this is an ASCII special character or a character outside of original ASCII 7-bits character set. So MarpaX::ESLIF know if there is need for unicode support or not in PCRE2 terminology (which is: any code point greater than 255, else byte matching is enough). This is important because PCRE2 yells if a pattern is using a large codepoint and if this internal PCRE2_UTF flag is not set accordingly.
The presence of this flag has an important consequence: if at least one string in the grammar implied the PCRE2_UTF flag, then the whole remaining chunk of data is translated and validated as an UTF-8 sequence of bytes. In such ca case, either the user input reader informed that this is stream of characters, then MarpaX::ESLIF prepared in advance the conversion/validation to UTF-8, either this is done lazily as soon as a match is attempted using a string requiring the PCRE2_UTF flag.
- String modifiers
-
String modifiers must be appended directly after the end delimiter of the string. They are restricted to
:i
, meaning that the match is caseless sensitive:- 'Black Heart Suite Character: ♥':i
-
A dump of it in terms of PCRE2 (c.f. the API specification for dump facility) would show the
PCRE2_CASELESS
flag:# Pattern: Black Heart Suite Character: \x{2665} # Flags: PCRE2_ANCHORED|PCRE2_CASELESS|PCRE2_UTF
You notice the presence of:
PCRE2_ANCHORED
-
Strings are always anchored at the point where match is attempted.
PCRE2_UTF
-
This flag is automatically set when the scanning of the string that is in the grammar, done internally by MarpaX::ESLIF, reveal the need for it.
- 'Example':i
-
would give the following dump:
# Pattern: Example # Flags: PCRE2_ANCHORED|PCRE2_CASELESS
- Character class
-
A character class is very closed to a regular expression (see later), except that it looks like a string, with start/end delimiters being
[]
, and that the pattern is NOT scanned. MarpaX::ESLIF will lets PCRE2 eventually yell if there is a use of codepoints and if the internal PCRE2_UTF flag is not set.MarpaX::ESLIF will try to guess the need for PCRE2_UTF flag by scanning the UTF-8 bytes composing the character class, but will do no modification. For example:
- [a-z]
-
will be dumped as:
# Pattern: # 0x000000: 5b 61 2d 7a 5d [a-z] # Flags: PCRE2_ANCHORED
- [a-z♥]
-
is dumped as:
# Pattern: # 0x000000: 5b 61 2d 7a e2 99 a5 5d [a-z...] # Flags: PCRE2_ANCHORED|PCRE2_UTF
You notice that the sequence
e299a5
that is the UTF-8 representation of the Black Heart Suite Character. MarpaX::ESLIF detected itas an explicit character
, so it was able to put the PCRE2_UTF flag automatically. But this will not work if you are using codepoints: - [a-z\x{2665}]
-
will yield automatically the following error, and this will come from the PCRE2 engine itself:
/[a-z\x{2665}]/: pcre2_compile failure at offset 11: character code point value in \x{} or \o{} is too large.
So there is a need for a modifier. Please see the section on "Character class and Regular expression modifiers". For instance, here, one would say:
- [a-z\x{2665}]:u
-
leaving to the following dump:
# 0x000000: 5b 61 2d 7a 5c 78 7b 32 36 36 35 7d 5d [a-z\x{2665}] # Flags: PCRE2_ANCHORED|PCRE2_UTF
- Regular expression
-
Nothing really distinguished regular expression and character classes in the grammar, except that sequence modifiers can be embedded directly in a regular expression, so that they are managed by PCRE2 instead of MarpaX::ESLIF, i.e:
- /[a-z]/
-
is stricly equivalent to the character
[a-z]
. - /[a-z]+/
-
really mean that the sequence is embedded in the regular expression. The dump of the later would say:
# Pattern: # 0x000000: 5b 61 2d 7a 5d 2b [a-z]+
In conclusion determining the need of the PCRE2_UTF8 will always be exact: either MarpaX::ESLIF will detect it correctly, either PCRE2 will yell, and you will have to explicitely set it using modifiers. Since character class is nothing else but a regular expression limited to a range of character, they both share the same possible set of modifiers.
- Character class and Regular expression modifiers
-
The only difference between the twos is how modifiers are expressed: for a character class they must be preceeded by the
:
character, while for a regular expression they can be set directly after the/
end delimiter (as in the Perl language).The explicit regular expression, being sent directly as-is to PCRE2, support de-facto all of the native PCRE2 pattern language, i.e. one can set regular expression options that have no single option equivalent when using a regular expression, for example:
- /(*LIMIT_MATCH=15)[a-z]+/
-
is setting an internal PCRE2 match limit to 15. The dump does not show that as an explicit flag:
# Pattern: # 0x000000: 28 2a 4c 49 4d 49 54 5f 4d 41 54 43 48 3d 31 35 (*LIMIT_MATCH=15 # 0x000010: 29 5b 61 2d 7a 5d 2b )[a-z]+ # Flags: PCRE2_ANCHORED
It is highly recommended to read the pcre2pattern documentation to know all the possible settings that can be embedded into the regular expression. Explicit modifiers are insipired by the jpcre2 and Perl language (most of the descriptions below are copy/pasted from jpcre2):
- e
-
Set the
PCRE2_MATCH_UNSET_BACKREF
flag.Unset back-references in the pattern will match to empty strings.
- i
-
Set the
PCRE2_CASELESS
flag.Case-insensitive.
- j
-
Set the
PCRE2_ALT_BSUX|PCRE2_MATCH_UNSET_BACKREF
flags.\u
,\U
and\x
and unset back-references will act as JavaScript standard:\U
-
Matches an upper case "U" character (by default it causes a compile error if this option is not set).
\u
-
Matches a lower case "u" character unless it is followed by four hexadecimal digits, in which case the hexadecimal number defines the code point to match (by default it causes a compile error if this option is not set).
\x
-
Matches a lower case "x" character unless it is followed by two hexadecimal digits, in which case the hexadecimal number defines the code point to match (By default, as in Perl, a hexadecimal number is always expected after
\x
, but it may have zero, one, or two digits (so, for example,\xz
matches a binary zero character followed by z) ).
Unset back-references in the pattern will match to empty strings.
- m
-
Set the
PCRE2_MULTILINE
flag.Multi-line regex.
- n
-
Set the
PCRE2_UCP
flag.Enable Unicode properties and extend meaning of meta-characters.
- s
-
Set the
PCRE2_DOTALL
flag.If this modifier is set, a dot meta-character in the pattern matches all characters, including newlines.
- x
-
Set the
PCRE2_EXTENDED
flag.Whitespace data characters in the pattern are totally ignored except when escaped or inside a character class, enables commentary in pattern.
For parsing reasons, it can be the parsing of our grammar fail if your are using the / character into embedded regular expression comments, though -;
- D
-
Set the
PCRE2_DOLLAR_ENDONLY
flag.A dollar meta-character in the pattern matches only at the end of the subject string. Without this modifier, a dollar also matches immediately before the final character if it is a newline (but not before any other newlines). This modifier is ignored if m modifier is set.
- J
-
Set the
PCRE2_DUPNAMES
flag.Allow duplicate names for sub-patterns.
- U
-
Set the
PCRE2_UNGREEDY
flag.This modifier inverts the "greediness" of the quantifiers so that they are not greedy by default, but become greedy if followed by ?.
- a
-
Unset the
PCRE2_UTF
flag.Only byte matching will work.
- N
-
Unset the
PCRE2_UCP
flag.Meta-characters will be limited to their ASCII equivalent.
- u
-
Set the
PCRE2_UTF
flag.Forces support of large codepoints.
- b
-
Unset the
PCRE2_UTF
flag, then set thePCRE2_NEVER_UTF
flag.Could mean "forced binary" mode.
- c
-
Unset the
PCRE2_NEVER_UTF
flag, then set thePCRE2_UTF
flag.Could mean "forced unicode character" mode.
- A
-
Unset the
PCRE2_ANCHORED
flag.Dangerous option, in case you want to do look-behind. Since the user has no control on how MarpaX::ESLIF manages its stream buffers, in theory it is safe to use this option only if the whole data is sent to MarpaX::ESLIF a single buffer.
Modifiers are always executed in order, and before the regular expression is compiled: if a regular expression compilation fail, corresponding message sent by PCRE2 will be echoed to your logger, with the
GENERICLOGGER_LOGLEVEL_ERROR
level. You will notice that all modifiers in common to jpcre2 have the same meaning, except for the n modifier: n in jpcre2 is nu for us.
GRAMMAR
Please refer to MarpaX::ESLIF::BNF for the ESLIF BNF expressed with itself. Fundamentals of the ESLIF grammar are the followings (incompatible change v.s. Marpa::R2's DSL are highlighted):
- Grammar levels
-
The number of levels is only limited by memory for your program -; Any symbol that have an impact on grammar at level
n
must be defined with such level explicitely:As a consequence the definition of a
:discard
symbol is incompatible with Marpa::R2's DSL, in which a discard rule affecting level 0 have the alias~
, for ESLIF it is::=
. - built-in actions and adverb lists
-
Any Marpa::R2's action or adverb list that require the Perl langage has been removed, for example
::array
,::bless
- LATM is true by default
-
LATM (Longest Acceptable Token Match) is preventing the scanner to push alternatives of lower length than the longest of the alternatives.
- pausing is allowed with discard events
:default
statement is unique per level-
... instead of being lexically scoped with Marpa::R2's DSL.
- syntactic exception is supported
-
... and it is managed at valuation phase.
- native regular expression support (via PCRE2)
- comments are extended to
C++
/C
styles
VERSIONING
MarpaX::ESLIF follows the Semantic Versioning 2.0.0, i:e:
- MAJOR for incompatible API changes
- MINOR for added functionality in a backwards-compatible manner
- PATCH for backwards-compatible bug fixes
BUILD
You must use a CMake version 3 at least. Recommended pattern is:
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DALL_IN_ONE=TRUE -DTCONV_USE_ICU=NO .
make
make check
SEE ALSO
MarpaX::ESLIF::BNF, MarpaX::ESLIF::Tutorial::Calculator, PCRE2, jpcre2.
NOTES
The perl interface is using an all-in-one version of the underlying marpaESLIF C library. This mean that character conversion relies on libiconv to do character translation if needed.
AUTHOR
Jean-Damien Durand <jeandamiendurand@free.fr>
COPYRIGHT AND LICENSE
This software is copyright (c) 2017 by Jean-Damien Durand.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.