NAME
re::engine::GNU - GNU Regular Expression Engine
VERSION
version 0.025
SYNOPSIS
use re::engine::GNU;
'test' =~ /\(tes\)t/ && print "ok 1\n";
'test' =~ [ 0, '\(tes\)t' ] && print "ok 2\n";
'test' =~ { syntax => 0, pattern => '\(tes\)t' } && print "ok 3\n";
DESCRIPTION
The GNU regular expression engine plugged into perl. The package can be "used" with the following pragmas:
- -debug => boolean
-
E.g. use re::engine::GNU -debug => 1; # a true value will print on stderr
- -syntax => bitwised value
-
E.g. use re::engine::GNU -syntax => 0; # Default syntax. Useful for the // form.
Regular expressions can be writen in three form:
- classic
-
e.g. qr/xxx/. The default syntax is then GNU Emacs.
- array
-
e.g. [ syntax, 'xxx' ], where syntax is a bitwised value.
- hash
-
e.g. { syntax => value, pattern => 'xxx' }, where value is bitwised, like in the array form.
The following convenient class variables are available for the syntax:
- $re::engine::GNU::RE_SYNTAX_ED
- $re::engine::GNU::RE_SYNTAX_EGREP
- $re::engine::GNU::RE_SYNTAX_EMACS (default)
- $re::engine::GNU::RE_SYNTAX_GNU_AWK
- $re::engine::GNU::RE_SYNTAX_GREP
- $re::engine::GNU::RE_SYNTAX_POSIX_AWK
- $re::engine::GNU::RE_SYNTAX_POSIX_BASIC
- $re::engine::GNU::RE_SYNTAX_POSIX_EGREP
- $re::engine::GNU::RE_SYNTAX_POSIX_EXTENDED
- $re::engine::GNU::RE_SYNTAX_POSIX_MINIMAL_BASIC
- $re::engine::GNU::RE_SYNTAX_POSIX_MINIMAL_EXTENDED
- $re::engine::GNU::RE_SYNTAX_SED
All the convenient class variables listed upper are made of these GNU internal bits, that you can also manipulate yourself to tune the syntax to your needs (documentation is copied verbatim from the file regex.h distributed with this package):
- $re::engine::GNU::RE_BACKSLASH_ESCAPE_IN_LISTS
-
If this bit is not set, then \ inside a bracket expression is literal. If set, then such a \ quotes the following character.
- $re::engine::GNU::RE_BK_PLUS_QM
-
If this bit is not set, then + and ? are operators, and \+ and \? are literals. If set, then \+ and \? are operators and + and ? are literals.
- $re::engine::GNU::RE_CHAR_CLASSES
-
If this bit is set, then character classes are supported. They are: [:alpha:], [:upper:], [:lower:], [:digit:], [:alnum:], [:xdigit:], [:space:], [:print:], [:punct:], [:graph:], and [:cntrl:]. If not set, then character classes are not supported.
- $re::engine::GNU::RE_CONTEXT_INDEP_ANCHORS
-
If this bit is set, then ^ and $ are always anchors (outside bracket expressions, of course). If this bit is not set, then it depends:
- ^
-
is an anchor if it is at the beginning of a regular expression or after an open-group or an alternation operator;
- $
-
is an anchor if it is at the end of a regular expression, or before a close-group or an alternation operator.
This bit could be (re)combined with RE_CONTEXT_INDEP_OPS, because POSIX draft 11.2 says that * etc. in leading positions is undefined. We already implemented a previous draft which made those constructs invalid, though, so we haven't changed the code back.
- $re::engine::GNU::RE_CONTEXT_INDEP_OPS
-
If this bit is set, then special characters are always special regardless of where they are in the pattern. If this bit is not set, then special characters are special only in some contexts; otherwise they are ordinary. Specifically, * + ? and intervals are only special when not after the beginning, open-group, or alternation operator.
- $re::engine::GNU::RE_CONTEXT_INVALID_OPS
-
If this bit is set, then *, +, ?, and { cannot be first in an re or immediately after an alternation or begin-group operator.
- $re::engine::GNU::RE_DOT_NEWLINE
-
If this bit is set, then . matches newline. If not set, then it doesn't.
- $re::engine::GNU::RE_DOT_NOT_NULL
-
If this bit is set, then . doesn't match NUL. If not set, then it does.
- $re::engine::GNU::RE_HAT_LISTS_NOT_NEWLINE
-
If this bit is set, nonmatching lists [^...] do not match newline. If not set, they do.
- $re::engine::GNU::RE_INTERVALS
-
If this bit is set, either \{...\} or {...} defines an interval, depending on RE_NO_BK_BRACES. If not set, \{, \}, {, and } are literals.
- $re::engine::GNU::RE_LIMITED_OPS
-
If this bit is set, +, ? and | aren't recognized as operators. If not set, they are.
- $re::engine::GNU::RE_NEWLINE_ALT
-
If this bit is set, newline is an alternation operator. If not set, newline is literal.
- $re::engine::GNU::RE_NO_BK_BRACES
-
If this bit is set, then '{...}' defines an interval, and \{ and \} are literals. If not set, then '\{...\}' defines an interval.
- $re::engine::GNU::RE_NO_BK_PARENS
-
If this bit is set, (...) defines a group, and \( and \) are literals. If not set, \(...\) defines a group, and ( and ) are literals.
- $re::engine::GNU::RE_NO_BK_REFS
-
If this bit is set, then \<digit> matches <digit>. If not set, then \<digit> is a back-reference.
- $re::engine::GNU::RE_NO_BK_VBAR
-
If this bit is set, then | is an alternation operator, and \| is literal. If not set, then \| is an alternation operator, and | is literal.
- $re::engine::GNU::RE_NO_EMPTY_RANGES
-
If this bit is set, then an ending range point collating higher than the starting range point, as in [z-a], is invalid. If not set, then when ending range point collates higher than the starting range point, the range is ignored.
- $re::engine::GNU::RE_UNMATCHED_RIGHT_PAREN_ORD
-
If this bit is set, then an unmatched ) is ordinary. If not set, then an unmatched ) is invalid.
- $re::engine::GNU::RE_NO_POSIX_BACKTRACKING
-
If this bit is set, succeed as soon as we match the whole pattern, without further backtracking.
- $re::engine::GNU::RE_NO_GNU_OPS
-
If this bit is set, do not process the GNU regex operators. If not set, then the GNU regex operators are recognized.
- $re::engine::GNU::RE_DEBUG
-
If this bit is set, turn on internal regex debugging. If not set, and debugging was on, turn it off. This only works if regex.c is compiled -DDEBUG. We define this bit always, so that all that's needed to turn on debugging is to recompile regex.c; the calling code can always have this bit set, and it won't affect anything in the normal case.
- $re::engine::GNU::RE_INVALID_INTERVAL_ORD
-
If this bit is set, a syntactically invalid interval is treated as a string of ordinary characters. For example, the ERE 'a{1' is treated as 'a\{1'.
- $re::engine::GNU::RE_ICASE
-
If this bit is set, then ignore case when matching. If not set, then case is significant.
- $re::engine::GNU::RE_CARET_ANCHORS_HERE
-
This bit is used internally like RE_CONTEXT_INDEP_ANCHORS but only for ^, because it is difficult to scan the regex backwards to find whether ^ should be special.
- $re::engine::GNU::RE_CONTEXT_INVALID_DUP
-
If this bit is set, then \{ cannot be first in a regex or immediately after an alternation, open-group or \} operator.
- $re::engine::GNU::RE_NO_SUB
-
If this bit is set, then no_sub will be set to 1 during re_compile_pattern.
Please refer to Gnulib Regular expression syntaxes documentation.
The following perl modifiers are supported and applied to the chosen syntax:
- //m
-
This is triggering an internal flag saying that newline is an anchor.
- //s
-
This is setting a bit in the syntax value, saying that "." can also match newline.
- //i
-
This is making the regular expression case insensitive.
- //p
-
Please refer to perlvar section about MATCH family.
The perl modifiers //x is explicited dropped.
EXPORT
None by default.
NAME
re::engine::GNU - Perl extension for GNU regular expressions
NOTES
- I18N
-
This is using the perl semantics with which this library is compiled.
- Collation
-
Collating symbols and Equivalence classes are not (yet supported).
- Execution and compilation semantics
-
The //msip perl semantics are applied at compile-time. Perl's localization if any always apply. The GNU regex semantic is in effect for the rest; for instance, there is no "last successful match" perl semantic in here.
SEE ALSO
GNU Gnulib Regular expressions
AUTHOR
Jean-Damien Durand <jeandamiendurand@free.fr>
CONTRIBUTOR
Yves Orton <demerphq@gmail.com>
COPYRIGHT AND LICENSE
This software is copyright (c) 2015 by Jean-Damien Durand.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.