NAME

Regexp::Parsertron - Parse a Perl regexp into a data structure of type Tree

Warning: Development version. See "Version Numbers" for details.

Synopsis

This is scripts/synopsis.pl:

#!/usr/bin/env perl

use v5.10;
use strict;
use warnings;

use Regexp::Parsertron;

# ---------------------

my($re)		= qr/Perl|JavaScript/i;
my($parser)	= Regexp::Parsertron -> new(verbose => 1);

# Return 0 for success and 1 for failure.

my($result) = $parser -> parse(re => $re);

say "Calling append(text => '|C++', uid => 6)";

$parser -> append(text => '|C++', uid => 6);
$parser -> print_raw_tree;
$parser -> print_cooked_tree;

my($as_string) = $parser -> as_string;

say "Original:  $re. Result: $result. (0 is success)";
say "as_string: $as_string";
say 'Perl error count:  ', $parser -> perl_error_count;
say 'Marpa error count: ', $parser -> marpa_error_count;

And its output:

Test count: 1. Parsing (in qr/.../ form): '(?^i:Perl|JavaScript)'.
Root. Attributes: {text => "Root", uid => "0"}
    |--- open_parenthesis. Attributes: {text => "(", uid => "1"}
    |    |--- question_mark. Attributes: {text => "?", uid => "2"}
    |    |--- caret. Attributes: {text => "^", uid => "3"}
    |    |--- flag_set. Attributes: {text => "i", uid => "4"}
    |    |--- colon. Attributes: {text => ":", uid => "5"}
    |    |--- character_set. Attributes: {text => "Perl|JavaScript", uid => "6"}
    |--- close_parenthesis. Attributes: {text => ")", uid => "7"}
Calling append(text => '|C++', uid => 6)
Root. Attributes: {text => "Root", uid => "0"}
    |--- open_parenthesis. Attributes: {text => "(", uid => "1"}
    |    |--- question_mark. Attributes: {text => "?", uid => "2"}
    |    |--- caret. Attributes: {text => "^", uid => "3"}
    |    |--- flag_set. Attributes: {text => "i", uid => "4"}
    |    |--- colon. Attributes: {text => ":", uid => "5"}
    |    |--- character_set. Attributes: {text => "Perl|JavaScript|C++", uid => "6"}
    |--- close_parenthesis. Attributes: {text => ")", uid => "7"}
Name                  Uid  Text
----                  ---  ----
open_parenthesis        1  (
question_mark           2  ?
caret                   3  ^
flag_set                4  i
colon                   5  :
character_set           6  Perl|JavaScript|C++
close_parenthesis       7  )
Original:  (?^i:Perl|JavaScript). Result: 0. (0 is success)
as_string: (?^i:Perl|JavaScript|C++)
Perl error count:  0
Marpa error count: 0

Note: The 1st tree is printed due to verbose => 1 in the call to "new([%opts])", while the 2nd is due to the call to "print_raw_tree()". The columnar output is due to the call to "print_cooked_tree()".

The Edit Methods

The edit methods simply means any one or more of these methods, which can all change the text of a node:

o "append(%opts)"
o "prepend(%opts)"
o "set(%opts)"

The edit methods are exercised in t/get.set.t, as well as scripts/synopsis.pl (above).

Description

Parses a regexp into a tree object managed by the Tree module, and provides various methods for updating and retrieving that tree's contents.

This module uses Marpa::R2 and Moo.

Distributions

This module is available as a Unix-style distro (*.tgz).

See http://savage.net.au/Perl-modules/html/installing-a-module.html for help on unpacking and installing distros.

Installation

Install Regexp::Parsertron as you would any Perl module:

Run:

cpanm Regexp::Parsertron

or run:

sudo cpan Regexp::Parsertron

or unpack the distro, and then use:

perl Makefile.PL
make (or dmake or nmake)
make test
make install

Constructor and Initialization

new() is called as my($parser) = Regexp::Parsertron -> new(k1 => v1, k2 => v2, ...).

It returns a new object of type Regexp::Parsertron.

Key-value pairs accepted in the parameter list (see corresponding methods for details [e.g. "re([$regexp])"]):

o re => $regexp

The does() method of Scalar::Does is called to see what re is. If it's already of the form qr/$re/, then it's processed as is, but if it's not, then it's transformed using qr/$re/.

Warning: Currently, the input is expected to have been pre-processed by Perl via qr/$regexp/.

Default: ''.

o verbose => $integer

Takes values 0, 1 or 2, which print more and more progress reports.

Used for debugging.

Default: 0 (print nothing).

Methods

append(%opts)

Append some text to the text of a node.

%opts is a hash with these (key => value) pairs:

o text => $string

The text to append.

o uid => $uid

The uid of the node to update.

See scripts/synopsis.pl for sample code.

Note: Calling append() never changes the uids of nodes, so repeated calling of append() with the same uid will apply more and more updates to the same node.

See also "get(%opts)", "prepend(%opts)", "set(%opts)" and t/get.set.t.

as_string()

Returns the parsed regexp as a string. The string contains all edits applied with methods such as "append(%opts)".

error_str()

Returns the last error, as a string.

Errors will be in 1 of 2 categories:

o Perl errors

These arise when Perl cannot interpret the string form of the regexp supplied by you, when the code checks it using qr/$re/.

o Marpa errors

These arise when the BNF within the module is such that the string form of the regexp cannot be parsed by Marpa.

If you can use the regexp in Perl code, then you should never get this error. In other words, if Perl accepts the regexp and the module does not, then the BNF in this module is wrong (barring bugs in Perl of course).

See also "marpa_error_count()", "perl_error_count()" and "warning_str()".

get($uid)

Get the text of the node whose uid is $uid.

Returns undef if the given $uid is not found.

See also "append(%opts)", "prepend(%opts)" and "set(%opts)".

marpa_error_count()

Returns an integer count of errors detected by Marpa. This value should always be 0.

See also "error_str()", "perl_error_count()" and "warning_str()".

Used basically for debugging.

new([%opts])

Here, '[]' indicate an optional parameter.

See "Constructor and Initialization" for details on the parameters accepted by "new()".

parse([%opts])

Here, '[]' indicate an optional parameter.

Parses the regexp supplied with the parameter re in the call to "new()" or in the call to "re($regexp)", or in the call to parse(re => $regexp) itself. The latter takes precedence.

The hash %opts takes the same (key => value) pairs as "new()" does.

See "Constructor and Initialization" for details.

perl_error_count()

Returns an integer count of errors detected by perl. This value should always be 0.

See also "error_str()" , "marpa_error_count()" and "warning_str()".

Used basically for debugging.

prepend(%opts)

Prepend some text to the text of a node.

%opts is a hash with these (key => value) pairs:

o text => $string

The text to prepend.

o uid => $uid

The uid of the node to update.

Note: Calling prepend() never changes the uids of nodes, so repeated calling of prepend() with the same uid will apply more and more updates to the same node.

See also "append(%opts)", "get(%opts)" and "set(%opts)" and t/get.set.t.

Prints, in a pretty format, the tree built from parsing.

See the </Synopsis> for sample output.

See also "print_raw_tree".

Prints, in a simple format, the tree built from parsing.

See the </Synopsis> for sample output.

See also "print_cooked_tree".

re([$regexp])

Here, '[]' indicate an optional parameter.

Gets or sets the regexp to be processed.

Note: re is a parameter to "new([%opts])".

reset()

Resets various internal things, except test_count.

Used basically for debugging.

set(%opts)

Set the text of a node to $opt{text}.

%opts is a hash with these (key => value) pairs:

o text => $string

The text to use to overwrite the text of the node.

o uid => $uid

The uid of the node to update.

See also "append(%opts)", "prepend(%opts)" and "get(%opts)".

tree()

Returns an object of type Tree. Ignore the root node.

Each node's meta method returns a hashref of information about the node. See the "FAQ" for details.

See also the source code for "print_cooked_tree()" and "print_raw_tree()" for ideas on how to use this object.

uid()

Returns the last-used uid.

Each node in the tree is given a uid, which allows methods like "append(%opts)" to work.

verbose([$integer])

Here, '[]' indicate an optional parameter.

Gets or sets the verbosity level, within the range 0 .. 2. Higher numbers print more progress reports.

Used basically for debugging.

Note: verbose is a parameter to "new([%opts])".

warning_str()

Returns the last Marpa warning, as a string.

See also "error_str()", "perl_error_count()" and "marpa_error_count()".

FAQ

How do I use this module?

Herewith a brief tutorial.

o Start with a simple program and a simple regexp

This code, scripts/tutorial.pl, is a cut-down version of scripts/synopsis.pl:

#!/usr/bin/env perl

use v5.10;
use strict;
use warnings;

use Regexp::Parsertron;

# ---------------------

my($re)		= qr/Perl|JavaScript/i;
my($parser)	= Regexp::Parsertron -> new(verbose => 1);

# Return 0 for success and 1 for failure.

my($result) = $parser -> parse(re => $re);

say "Original:  $re. Result: $result. (0 is success)";

Running it outputs:

Test count: 1. Parsing (in qr/.../ form): '(?^i:Perl|JavaScript)'.
Root. Attributes: {text => "Root", uid => "0"}
    |--- open_parenthesis. Attributes: {text => "(", uid => "1"}
    |    |--- question_mark. Attributes: {text => "?", uid => "2"}
    |    |--- caret. Attributes: {text => "^", uid => "3"}
    |    |--- flag_set. Attributes: {text => "i", uid => "4"}
    |    |--- colon. Attributes: {text => ":", uid => "5"}
    |    |--- character_set. Attributes: {text => "Perl|JavaScript", uid => "6"}
    |--- close_parenthesis. Attributes: {text => ")", uid => "7"}

Original:  (?^i:Perl|JavaScript). Result: 0. (0 is success)
o Examine the tree and determine which nodes you wish to edit

The nodes are uniquely identified by their uids.

o Proceed as does scripts/synopsis.pl

Add these lines to the end of the tutorial code, and re-run:

$parser -> append(text => '|C++', uid => 6);
$parser -> print_raw_tree;

The extra output, showing node uid == 6, is:

Root. Attributes: {text => "Root", uid => "0"}
    |--- open_parenthesis. Attributes: {text => "(", uid => "1"}
    |    |--- question_mark. Attributes: {text => "?", uid => "2"}
    |    |--- caret. Attributes: {text => "^", uid => "3"}
    |    |--- flag_set. Attributes: {text => "i", uid => "4"}
    |    |--- colon. Attributes: {text => ":", uid => "5"}
    |    |--- character_set. Attributes: {text => "Perl|JavaScript|C++", uid => "6"}
    |--- close_parenthesis. Attributes: {text => ")", uid => "7"}
o Test also with "prepend(%opts)" and "set(%opts)"

See t/get.set.t for sample code.

o Since everything works, make a cup of tea

What is the purpose of this module?

o To provide a stand-alone parser for regexps
o To help me learn more about regexps
o To become, I hope, a replacement for the horrendously complex Regexp::Assemble

What is the format of the nodes in the tree build by this module?

Each node's name is the name of the Marpa-style event which was triggered by detection of some text within the regexp.

Each node's meta method returns a hashref with these (key => value) pairs:

o text => $string

This is the text within the regexp which triggered the event just mentioned.

o uid => $integer

This is the unqiue id of the 'current' node.

This <uid> is often used by you to specify which node to work on.

See also the source code for "print_cooked_tree()" and "print_raw_tree()" for ideas on how to use this object.

See the "Synopsis" for sample code and a report after parsing a tiny regexp.

Does this module interpret regexps in any way?

No. You have to run your own Perl code to do that. This module just parses them into a data structure.

And that really means this module does not match the regexp against anything. If I appear to do that while debugging new code, you can't rely on that appearing in production versions of the module.

Does this module re-write regexps?

No, unless you call one of "The Edit Methods".

Does this module handle both Perl 5 and Perl 6?

No. It will only handle Perl 5 syntax.

Does this module handle regexps for various versions of Perl5?

Not yet. Version-dependent regexp syntax will be supported for recent versions of Perl. This is done by having tokens within the BNF which are replaced at start-up time with version-dependent details.

There are no such tokens at the moment.

All debugging is done assuming the regexp syntax as documented online. See "References" for the urls in question.

So which version of Perl is supported?

I'm (2018-01-14) using Perl V 5.20.2 and making the BNF match the Perl regexp docs listed in </References> below.

Is this a (Marpa) exhaustion-hating or exhaustion-loving app?

Exhaustion-loving.

In short, Marpa will always report 'Marpa parse exhausted', but this is not an error.

See https://metacpan.org/pod/distribution/Marpa-R2/pod/Exhaustion.pod#Exhaustion

Will this code be modified to run under Marpa::R3 when the latter is stable?

Yes.

Scripts

This diagram indicates the flow of logic from script to script:

xt/author/re_tests
|
V
xt/author/generate.tests.pl
|
V
xt/authors/perl-5.21.11.tests
|
V
perl -Ilib t/perl-5.21.11.t > xt/author/perl-5.21.11.log 2>&1

If xt/author/perl-5.21.11.log only contains lines starting with 'ok', then all Perl and Marpa errors have been hidden, so t/perl-5.21.11.t is ready to live in t/. Before that time it lives in xt/author/.

TODO

o How to best define 'code' in the BNF.
o Things to be aware of:
o Regexps of the form: /.../aa
o Pragmas for the form: use re '/aa'; ...
o I could traverse the tree and store a pointer to each node in an array

This would mean fast access to nodes in random order.

References

http://www.pcre.org/. PCRE - Perl Compatible Regular Expressions.

http://perldoc.perl.org/perlre.html. This is the definitive document.

http://perldoc.perl.org/perlrecharclass.html#Extended-Bracketed-Character-Classes.

http://perldoc.perl.org/perlretut.html. Samples with commentary.

http://perldoc.perl.org/perlop.html#Regexp-Quote-Like-Operators

http://perldoc.perl.org/perlrequick.html

http://perldoc.perl.org/perlrebackslash.html

http://www.nntp.perl.org/group/perl.perl5.porters/2016/02/msg234642.html

See Also

Graph::Regexp

Regexp::Assemble

Regexp::Debugger

Regexp::ERE

Regexp::Keywords

Regexp::Lexer

Regexp::List

Regexp::Optimizer

Regexp::Parser

Regexp::SAR. This is vaguely a version of Set::FA::Element.

Regexp::Stringify

Regexp::Trie

And many others...

Machine-Readable Change Log

The file Changes was converted into Changelog.ini by Module::Metadata::Changes.

Version Numbers

Version numbers < 1.00 represent development versions. From 1.00 up, they are production versions.

Repository

https://github.com/ronsavage/Regexp-Parsertron

References

https://code.activestate.com/lists/perl5-porters/209610/

https://stackoverflow.com/questions/46200305/a-strict-regular-expression-for-matching-chemical-formulae

Support

Email the author, or log a bug on RT:

https://rt.cpan.org/Public/Dist/Display.html?Name=Regexp::Parsertron.

Author

Regexp::Parsertron was written by Ron Savage <ron@savage.net.au> in 2011.

Marpa's homepage: http://savage.net.au/Marpa.html.

My homepage.

Copyright

Australian copyright (c) 2016, Ron Savage.

All Programs of mine are 'OSI Certified Open Source Software';
you can redistribute them and/or modify them under the terms of
The Artistic License 2.0, a copy of which is available at:
http://opensource.org/licenses/alphabetical.