NAME

re::engine::GNU - GNU Regular Expression Engine

VERSION

version 0.025

SYNOPSIS

use re::engine::GNU;
'test' =~ /\(tes\)t/ && print "ok 1\n";
'test' =~ [ 0, '\(tes\)t' ] && print "ok 2\n";
'test' =~ { syntax => 0, pattern => '\(tes\)t' } && print "ok 3\n";

DESCRIPTION

The GNU regular expression engine plugged into perl. The package can be "used" with the following pragmas:

-debug => boolean

E.g. use re::engine::GNU -debug => 1; # a true value will print on stderr

-syntax => bitwised value

E.g. use re::engine::GNU -syntax => 0; # Default syntax. Useful for the // form.

Regular expressions can be writen in three form:

classic

e.g. qr/xxx/. The default syntax is then GNU Emacs.

array

e.g. [ syntax, 'xxx' ], where syntax is a bitwised value.

hash

e.g. { syntax => value, pattern => 'xxx' }, where value is bitwised, like in the array form.

The following convenient class variables are available for the syntax:

$re::engine::GNU::RE_SYNTAX_ED
$re::engine::GNU::RE_SYNTAX_EGREP
$re::engine::GNU::RE_SYNTAX_EMACS (default)
$re::engine::GNU::RE_SYNTAX_GNU_AWK
$re::engine::GNU::RE_SYNTAX_GREP
$re::engine::GNU::RE_SYNTAX_POSIX_AWK
$re::engine::GNU::RE_SYNTAX_POSIX_BASIC
$re::engine::GNU::RE_SYNTAX_POSIX_EGREP
$re::engine::GNU::RE_SYNTAX_POSIX_EXTENDED
$re::engine::GNU::RE_SYNTAX_POSIX_MINIMAL_BASIC
$re::engine::GNU::RE_SYNTAX_POSIX_MINIMAL_EXTENDED
$re::engine::GNU::RE_SYNTAX_SED

All the convenient class variables listed upper are made of these GNU internal bits, that you can also manipulate yourself to tune the syntax to your needs (documentation is copied verbatim from the file regex.h distributed with this package):

$re::engine::GNU::RE_BACKSLASH_ESCAPE_IN_LISTS

If this bit is not set, then \ inside a bracket expression is literal. If set, then such a \ quotes the following character.

$re::engine::GNU::RE_BK_PLUS_QM

If this bit is not set, then + and ? are operators, and \+ and \? are literals. If set, then \+ and \? are operators and + and ? are literals.

$re::engine::GNU::RE_CHAR_CLASSES

If this bit is set, then character classes are supported. They are: [:alpha:], [:upper:], [:lower:], [:digit:], [:alnum:], [:xdigit:], [:space:], [:print:], [:punct:], [:graph:], and [:cntrl:]. If not set, then character classes are not supported.

$re::engine::GNU::RE_CONTEXT_INDEP_ANCHORS

If this bit is set, then ^ and $ are always anchors (outside bracket expressions, of course). If this bit is not set, then it depends:

^

is an anchor if it is at the beginning of a regular expression or after an open-group or an alternation operator;

$

is an anchor if it is at the end of a regular expression, or before a close-group or an alternation operator.

This bit could be (re)combined with RE_CONTEXT_INDEP_OPS, because POSIX draft 11.2 says that * etc. in leading positions is undefined. We already implemented a previous draft which made those constructs invalid, though, so we haven't changed the code back.

$re::engine::GNU::RE_CONTEXT_INDEP_OPS

If this bit is set, then special characters are always special regardless of where they are in the pattern. If this bit is not set, then special characters are special only in some contexts; otherwise they are ordinary. Specifically, * + ? and intervals are only special when not after the beginning, open-group, or alternation operator.

$re::engine::GNU::RE_CONTEXT_INVALID_OPS

If this bit is set, then *, +, ?, and { cannot be first in an re or immediately after an alternation or begin-group operator.

$re::engine::GNU::RE_DOT_NEWLINE

If this bit is set, then . matches newline. If not set, then it doesn't.

$re::engine::GNU::RE_DOT_NOT_NULL

If this bit is set, then . doesn't match NUL. If not set, then it does.

$re::engine::GNU::RE_HAT_LISTS_NOT_NEWLINE

If this bit is set, nonmatching lists [^...] do not match newline. If not set, they do.

$re::engine::GNU::RE_INTERVALS

If this bit is set, either \{...\} or {...} defines an interval, depending on RE_NO_BK_BRACES. If not set, \{, \}, {, and } are literals.

$re::engine::GNU::RE_LIMITED_OPS

If this bit is set, +, ? and | aren't recognized as operators. If not set, they are.

$re::engine::GNU::RE_NEWLINE_ALT

If this bit is set, newline is an alternation operator. If not set, newline is literal.

$re::engine::GNU::RE_NO_BK_BRACES

If this bit is set, then '{...}' defines an interval, and \{ and \} are literals. If not set, then '\{...\}' defines an interval.

$re::engine::GNU::RE_NO_BK_PARENS

If this bit is set, (...) defines a group, and \( and \) are literals. If not set, \(...\) defines a group, and ( and ) are literals.

$re::engine::GNU::RE_NO_BK_REFS

If this bit is set, then \<digit> matches <digit>. If not set, then \<digit> is a back-reference.

$re::engine::GNU::RE_NO_BK_VBAR

If this bit is set, then | is an alternation operator, and \| is literal. If not set, then \| is an alternation operator, and | is literal.

$re::engine::GNU::RE_NO_EMPTY_RANGES

If this bit is set, then an ending range point collating higher than the starting range point, as in [z-a], is invalid. If not set, then when ending range point collates higher than the starting range point, the range is ignored.

$re::engine::GNU::RE_UNMATCHED_RIGHT_PAREN_ORD

If this bit is set, then an unmatched ) is ordinary. If not set, then an unmatched ) is invalid.

$re::engine::GNU::RE_NO_POSIX_BACKTRACKING

If this bit is set, succeed as soon as we match the whole pattern, without further backtracking.

$re::engine::GNU::RE_NO_GNU_OPS

If this bit is set, do not process the GNU regex operators. If not set, then the GNU regex operators are recognized.

$re::engine::GNU::RE_DEBUG

If this bit is set, turn on internal regex debugging. If not set, and debugging was on, turn it off. This only works if regex.c is compiled -DDEBUG. We define this bit always, so that all that's needed to turn on debugging is to recompile regex.c; the calling code can always have this bit set, and it won't affect anything in the normal case.

$re::engine::GNU::RE_INVALID_INTERVAL_ORD

If this bit is set, a syntactically invalid interval is treated as a string of ordinary characters. For example, the ERE 'a{1' is treated as 'a\{1'.

$re::engine::GNU::RE_ICASE

If this bit is set, then ignore case when matching. If not set, then case is significant.

$re::engine::GNU::RE_CARET_ANCHORS_HERE

This bit is used internally like RE_CONTEXT_INDEP_ANCHORS but only for ^, because it is difficult to scan the regex backwards to find whether ^ should be special.

$re::engine::GNU::RE_CONTEXT_INVALID_DUP

If this bit is set, then \{ cannot be first in a regex or immediately after an alternation, open-group or \} operator.

$re::engine::GNU::RE_NO_SUB

If this bit is set, then no_sub will be set to 1 during re_compile_pattern.

Please refer to Gnulib Regular expression syntaxes documentation.

The following perl modifiers are supported and applied to the chosen syntax:

//m

This is triggering an internal flag saying that newline is an anchor.

//s

This is setting a bit in the syntax value, saying that "." can also match newline.

//i

This is making the regular expression case insensitive.

//p

Please refer to perlvar section about MATCH family.

The perl modifiers //x is explicited dropped.

EXPORT

None by default.

NAME

re::engine::GNU - Perl extension for GNU regular expressions

NOTES

I18N

This is using the perl semantics with which this library is compiled.

Collation

Collating symbols and Equivalence classes are not (yet supported).

Execution and compilation semantics

The //msip perl semantics are applied at compile-time. Perl's localization if any always apply. The GNU regex semantic is in effect for the rest; for instance, there is no "last successful match" perl semantic in here.

SEE ALSO

GNU Gnulib Regular expressions

perlre

AUTHOR

Jean-Damien Durand <jeandamiendurand@free.fr>

CONTRIBUTOR

Yves Orton <demerphq@gmail.com>

COPYRIGHT AND LICENSE

This software is copyright (c) 2015 by Jean-Damien Durand.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.