NAME

Regexp::Parsertron - Parse a Perl regexp into a data structure of type Tree

Warning: Development version. See "Version Numbers" for details.

Synopsis

This is scripts/synopsis.pl:

#!/usr/bin/env perl

use v5.10;
use strict;
use warnings;

use Regexp::Parsertron;

# ---------------------

my($re)		= qr/Perl|JavaScript/i;
my($parser)	= Regexp::Parsertron -> new(verbose => 1);

# Return 0 for success and 1 for failure.

my($result) = $parser -> parse(re => $re);

say "Calling append(text => '|C++', uid => 6)";

$parser -> append(text => '|C++', uid => 6);
$parser -> print_raw_tree;
$parser -> print_cooked_tree;

my($as_string) = $parser -> as_string;

say "Original:  $re. Result: $result. (0 is success)";
say "as_string: $as_string";

And its output:

Test count: 1. Parsing (in qr/.../ form): '(?^i:Perl|JavaScript)'.
Root. Attributes: {text => "Root", uid => "0"}
    |--- open_parenthesis. Attributes: {text => "(", uid => "1"}
    |    |--- question_mark. Attributes: {text => "?", uid => "2"}
    |    |--- caret. Attributes: {text => "^", uid => "3"}
    |    |--- flag_set. Attributes: {text => "i", uid => "4"}
    |    |--- colon. Attributes: {text => ":", uid => "5"}
    |    |--- character_set. Attributes: {text => "Perl|JavaScript", uid => "6"}
    |--- close_parenthesis. Attributes: {text => ")", uid => "7"}
Calling append(text => '|C++', uid => 6)
Root. Attributes: {text => "Root", uid => "0"}
    |--- open_parenthesis. Attributes: {text => "(", uid => "1"}
    |    |--- question_mark. Attributes: {text => "?", uid => "2"}
    |    |--- caret. Attributes: {text => "^", uid => "3"}
    |    |--- flag_set. Attributes: {text => "i", uid => "4"}
    |    |--- colon. Attributes: {text => ":", uid => "5"}
    |    |--- character_set. Attributes: {text => "Perl|JavaScript|C++", uid => "6"}
    |--- close_parenthesis. Attributes: {text => ")", uid => "7"}
Name                  Uid  Text
----                  ---  ----
open_parenthesis        1  (
question_mark           2  ?
caret                   3  ^
flag_set                4  i
colon                   5  :
character_set           6  Perl|JavaScript|C++
close_parenthesis       7  )
Original:  (?^i:Perl|JavaScript). Result: 0. (0 is success)
as_string: (?^i:Perl|JavaScript|C++)

Note: The 1st tree is printed due to verbose => 1 in the call to "new([%opts])", while the 2nd is due to the call to "print_raw_tree()". The columnar output is due to the call to "print_cooked_tree()".

The Edit Methods

The edit methods simply means any one or more of these methods, which can all change the text of a node:

o "append(%opts)"
o "prepend(%opts)"
o "set(%opts)"

The edit methods are exercised in t/get.set.t, as well as scripts/synopsis.pl (above).

Description

Parses a regexp into a tree object managed by the Tree module, and provides various methods for updating and retrieving that tree's contents.

This module uses Marpa::R2 and Moo.

Distributions

This module is available as a Unix-style distro (*.tgz).

See http://savage.net.au/Perl-modules/html/installing-a-module.html for help on unpacking and installing distros.

Installation

Install Regexp::Parsertron as you would any Perl module:

Run:

cpanm Regexp::Parsertron

or run:

sudo cpan Regexp::Parsertron

or unpack the distro, and then use:

perl Makefile.PL
make (or dmake or nmake)
make test
make install

Constructor and Initialization

new() is called as my($parser) = Regexp::Parsertron -> new(k1 => v1, k2 => v2, ...).

It returns a new object of type Regexp::Parsertron.

Key-value pairs accepted in the parameter list (see corresponding methods for details [e.g. "re([$regexp])"]):

o re => $regexp

The does() method of Scalar::Does is called to see what re is. If it's already of the form qr/$re/, then it's processed as is, but if it's not, then it's transformed using qr/$re/.

Warning: Currently, the input is expected to have been pre-processed by Perl via qr/$regexp/.

Default: ''.

o verbose => $integer

Takes values 0, 1 or 2, which print more and more progress reports.

Used for debugging.

Default: 0 (print nothing).

Methods

append(%opts)

Append some text to the text of a node.

%opts is a hash with these (key => value) pairs:

o text => $string

The text to append.

o uid => $uid

The uid of the node to update.

The code calls die() if %opts does not have these 2 keys, or if either value is undef.

See scripts/synopsis.pl for sample code.

Note: Calling append() never changes the uids of nodes, so repeated calling of append() with the same uid will apply more and more updates to the same node.

See also "prepend(%opts)", "set(%opts)" and t/get.set.t.

as_string()

Returns the parsed regexp as a string. The string contains all edits applied with methods such as "append(%opts)".

find($string)

Returns an arrayref of node uids whose text contains the given string.

The code calls die() if $target is undef.

If the arrayref is empty, there were no matches.

This method uses the Perl index() function to test if $string is a substring of the text of each node. Regexps are not used by this method.

See scripts/play.pl and t/get.set.t for sample usage of find().

See also "get($uid)".

get($uid)

Get the text of the node with the given $uid.

The code calls die() if $uid is undef, or outside the range 1 .. $self -> uid. The latter value is the highest uid so far assigned to any node.

Returns undef if the given $uid is not found.

See also "find($string)".

new([%opts])

Here, '[]' indicate an optional parameter.

See "Constructor and Initialization" for details on the parameters accepted by "new()".

parse([%opts])

Here, '[]' indicate an optional parameter.

Parses the regexp supplied with the parameter re in the call to "new()" or in the call to "re($regexp)", or in the call to parse(re => $regexp) itself. The latter takes precedence.

The hash %opts takes the same (key => value) pairs as "new()" does.

See "Constructor and Initialization" for details.

prepend(%opts)

Prepend some text to the text of a node.

%opts is a hash with these (key => value) pairs:

o text => $string

The text to prepend.

o uid => $uid

The uid of the node to update.

The code calls die() if %opts does not have these 2 keys, or if either value is undef.

Note: Calling prepend() never changes the uids of nodes, so repeated calling of prepend() with the same uid will apply more and more updates to the same node.

See also "append(%opts)", "set(%opts)", and t/get.set.t.

Prints, in a pretty format, the tree built from parsing.

See the </Synopsis> for sample output.

See also "print_raw_tree".

Prints, in a simple format, the tree built from parsing.

See the </Synopsis> for sample output.

See also "print_cooked_tree".

re([$regexp])

Here, '[]' indicate an optional parameter.

Gets or sets the regexp to be processed.

Note: re is a parameter to "new([%opts])".

reset()

Resets various internal things, except test_count.

Used basically for debugging.

set(%opts)

Set the text of a node to $opt{text}.

%opts is a hash with these (key => value) pairs:

o text => $string

The text to use to overwrite the text of the node.

o uid => $uid

The uid of the node to update.

The code calls die() if %opts does not have these 2 keys, or if either value is undef.

See also "append(%opts)" and "prepend(%opts)".

tree()

Returns an object of type Tree. Ignore the root node.

Each node's meta method returns a hashref of information about the node. See the "FAQ" for details.

See also the source code for "print_cooked_tree()" and "print_raw_tree()" for ideas on how to use this object.

uid()

Returns the last-used uid.

Each node in the tree is given a uid, which allows methods like "append(%opts)" to work.

verbose([$integer])

Here, '[]' indicate an optional parameter.

Gets or sets the verbosity level, within the range 0 .. 2. Higher numbers print more progress reports.

Used basically for debugging.

Note: verbose is a parameter to "new([%opts])".

warning_str()

Returns the last Marpa warning.

In short, Marpa will always report 'Marpa parse exhausted' in warning_str() if the parse is not ambiguous, but do not worry - this is not an error.

See After calling parse(), warning_str() contains the string '... Parse ambiguous ...' and Is this a (Marpa) exhaustion-hating or exhaustion-loving app?.

FAQ

How do I use this module?

Herewith a brief tutorial.

o Start with a simple program and a simple regexp

This code, scripts/tutorial.pl, is a cut-down version of scripts/synopsis.pl:

#!/usr/bin/env perl

use v5.10;
use strict;
use warnings;

use Regexp::Parsertron;

# ---------------------

my($re)		= qr/Perl|JavaScript/i;
my($parser)	= Regexp::Parsertron -> new(verbose => 1);

# Return 0 for success and 1 for failure.

my($result) = $parser -> parse(re => $re);

say "Original:  $re. Result: $result. (0 is success)";

Running it outputs:

Test count: 1. Parsing (in qr/.../ form): '(?^i:Perl|JavaScript)'.
Root. Attributes: {text => "Root", uid => "0"}
    |--- open_parenthesis. Attributes: {text => "(", uid => "1"}
    |    |--- question_mark. Attributes: {text => "?", uid => "2"}
    |    |--- caret. Attributes: {text => "^", uid => "3"}
    |    |--- flag_set. Attributes: {text => "i", uid => "4"}
    |    |--- colon. Attributes: {text => ":", uid => "5"}
    |    |--- character_set. Attributes: {text => "Perl|JavaScript", uid => "6"}
    |--- close_parenthesis. Attributes: {text => ")", uid => "7"}

Original:  (?^i:Perl|JavaScript). Result: 0. (0 is success)
o Examine the tree and determine which nodes you wish to edit

The nodes are uniquely identified by their uids.

o Proceed as does scripts/synopsis.pl

Add these lines to the end of the tutorial code, and re-run:

$parser -> append(text => '|C++', uid => 6);
$parser -> print_raw_tree;

The extra output, showing node uid == 6, is:

Root. Attributes: {text => "Root", uid => "0"}
    |--- open_parenthesis. Attributes: {text => "(", uid => "1"}
    |    |--- question_mark. Attributes: {text => "?", uid => "2"}
    |    |--- caret. Attributes: {text => "^", uid => "3"}
    |    |--- flag_set. Attributes: {text => "i", uid => "4"}
    |    |--- colon. Attributes: {text => ":", uid => "5"}
    |    |--- character_set. Attributes: {text => "Perl|JavaScript|C++", uid => "6"}
    |--- close_parenthesis. Attributes: {text => ")", uid => "7"}
o Test also with "prepend(%opts)" and "set(%opts)"

See t/get.set.t for sample code.

o Since everything works, make a cup of tea

Does this module ever use \Q...\E to quote regexp metacharacters?

No.

What is the format of the nodes in the tree build by this module?

Each node's name is the name of the Marpa-style event which was triggered by detection of some text within the regexp.

Each node's meta() method returns a hashref with these (key => value) pairs:

o text => $string

This is the text within the regexp which triggered the event just mentioned.

o uid => $integer

This is the unqiue id of the 'current' node.

This uid is often used by you to specify which node to work on.

See also the source code for "print_cooked_tree()" and "print_raw_tree()" for ideas on how to use the tree.

See the "Synopsis" for sample code and a report after parsing a tiny regexp.

Does the root node in the tree ever hold useful information?

No. Always ignore it.

Does this module interpret regexps in any way?

No. You have to run your own Perl code to do that. This module just parses them into a data structure.

And that really means this module does not match the regexp against anything. If I appear to do that while debugging new code, you can't rely on that appearing in production versions of the module.

Does this module re-write regexps?

No, unless you call one of "The Edit Methods".

Does this module handle both Perl 5 and Perl 6?

No. It will only handle Perl 5 syntax.

Does this module handle regexps for various versions of Perl5?

Not yet. Version-dependent regexp syntax will be supported for recent versions of Perl. This is done by having tokens within the BNF which are replaced at start-up time with version-dependent details.

There are no such tokens at the moment.

All debugging is done assuming the regexp syntax as documented online. See "References" for the urls in question.

So which version of Perl is supported?

I'm (2018-01-14) using Perl V 5.20.2 and making the BNF match the Perl regexp docs listed in "References" below.

After calling parse(), warning_str() contains the string '... Parse ambiguous ...'

This is almost certainly a error with the BNF, although of course it may be an error will an exceptionally-badly formed regexp.

Report it via https://rt.cpan.org/Public/Dist/Display.html?Name=Regexp-Parsertron, and please include the regexp in the report. Thanx!

Is this a (Marpa) exhaustion-hating or exhaustion-loving app?

Exhaustion-loving.

See https://metacpan.org/pod/distribution/Marpa-R2/pod/Exhaustion.pod#Exhaustion

Will this code be modified to run under Marpa::R3 when the latter is stable?

Yes.

What is the purpose of this module?

o To provide a stand-alone parser for regexps
o To help me learn more about regexps
o To become, I hope, a replacement for the horrendously complex Regexp::Assemble

Scripts

This diagram indicates the flow of logic from script to script:

xt/author/re_tests
|
V
xt/author/generate.tests.pl
|
V
xt/authors/perl-5.21.11.tests
|
V
perl -Ilib t/perl-5.21.11.t > xt/author/perl-5.21.11.log 2>&1

If xt/author/perl-5.21.11.log only contains lines starting with 'ok', then all Perl and Marpa errors have been hidden, so t/perl-5.21.11.t is ready to live in t/. Before that time it lives in xt/author/.

TODO

o How to best define 'code' in the BNF.
o Things to be aware of:
o Regexps of the form: /.../aa
o Pragmas for the form: use re '/aa'; ...
o I could traverse the tree and store a pointer to each node in an array

This would mean fast access to nodes in random order.

References

http://www.pcre.org/. PCRE - Perl Compatible Regular Expressions.

http://perldoc.perl.org/perlre.html. This is the definitive document.

http://perldoc.perl.org/perlrecharclass.html#Extended-Bracketed-Character-Classes.

http://perldoc.perl.org/perlretut.html. Samples with commentary.

http://perldoc.perl.org/perlop.html#Regexp-Quote-Like-Operators

http://perldoc.perl.org/perlrequick.html

http://perldoc.perl.org/perlrebackslash.html

http://www.nntp.perl.org/group/perl.perl5.porters/2016/02/msg234642.html

https://code.activestate.com/lists/perl5-porters/209610/

https://stackoverflow.com/questions/46200305/a-strict-regular-expression-for-matching-chemical-formulae

See Also

Graph::Regexp

Regexp::Assemble

Regexp::Debugger

Regexp::ERE

Regexp::Keywords

Regexp::Lexer

Regexp::List

Regexp::Optimizer

Regexp::Parser

Regexp::SAR. This is vaguely a version of Set::FA::Element.

Regexp::Stringify

Regexp::Trie

And many others...

Machine-Readable Change Log

The file Changes was converted into Changelog.ini by Module::Metadata::Changes.

Version Numbers

Version numbers < 1.00 represent development versions. From 1.00 up, they are production versions.

Repository

https://github.com/ronsavage/Regexp-Parsertron

Support

Email the author, or log a bug on RT:

https://rt.cpan.org/Public/Dist/Display.html?Name=Regexp::Parsertron.

Author

Regexp::Parsertron was written by Ron Savage <ron@savage.net.au> in 2011.

Marpa's homepage: http://savage.net.au/Marpa.html.

My homepage.

Copyright

Australian copyright (c) 2016, Ron Savage.

All Programs of mine are 'OSI Certified Open Source Software';
you can redistribute them and/or modify them under the terms of
The Artistic License 2.0, a copy of which is available at:
http://opensource.org/licenses/alphabetical.