NAME

Regexp::ERE - extended regular expressions and finite automata

SYNOPSIS

use Regexp::ERE qw(
    &ere_to_nfa
    &nfa_inter
    &nfa_to_regex
    &nfa_to_input_constraints
    &nfa_to_dfa
    &dfa_to_min_dfa
);

# condition 1: begins with abc or def
my $nfa1 = ere_to_nfa('^(abc|def)');

# condition 2: ends with 123 or 456
my $nfa2 = ere_to_nfa('(123|456)$');

# condition 1 and condition 2
my $inter_nfa = nfa_inter($nfa1, $nfa2);

# compute extended regular expression (string)
my $ere = nfa_to_regex($inter_nfa);

# compute perl regular expression
my $perlre = nfa_to_regex($inter_nfa, 1);

# compute weaker input constraints suitable for widgets
my ($input_constraints, $split_perlre)
  = nfa_to_input_constraints($inter_nfa);

# minimal dfa (simpler regular expression happens to result)
my $nfa3 = ere_to_nfa('^(a|ab|b)*$');
my $dfa3 = nfa_to_dfa($nfa3);
my $min_dfa3 = dfa_to_min_dfa($dfa3);
my $ere3 = nfa_to_regex($min_dfa3);

DESCRIPTION

Pure-perl module for:

  • Parsing POSIX Extended Regular Expressions ($ere) into Non-Deterministic Finite Automata ($nfa)

  • Manipulating $nfas (concatenating, or-ing, and-ing)

  • Computing Deterministic Finite Automata ($dfas) from $nfas (powerset construction)

  • Computing minimal $dfas from $dfas (Hopcroft's algorithm)

  • Computing $eres or Perl Regular Expressions from $nfa or $dfa (Warshall algorithm)

  • Heuristically deriving (possibly weaker) constraints from a $nfa or $dfa suitable for display in a graphical user interface, i.e. a sequence of widgets of type 'free text' and 'drop down';

    Example: '^(abc|def)' => $nfa => [['abc', 'def'], 'free text']

GLOSSARY AND CONVERSIONS OVERVIEW

Conversions overview

$ere -> $nfa -> $tree -> $regex  ($ere or $perlre)
                      -> $input_constraints

The second argument of -> $regex conversions is an optional boolean,
    true : conversion to a compiled perl regular expression
    false: conversion to an ere string

The -> $input_constraints conversions return a pair (
    $input_constraints: aref as described at tree_to_input_constraints()
    $split_perlre     : a compiled perl regular expression
)

Glossary

$char_class

A set of unicode characters.

$ere

Extended regular expression (string). See ere_to_nfa($ere) for the exact syntax.

$perlre

Perl regular expression

$nfa

Non-deterministic finite automaton

$dfa

Deterministic finite automaton (special case of $nfa)

$tree

Intermediate hierarchical representation of a regular expression (which still can be manipulated before stringification), similar to a parse tree (but used for generating, not for parsing).

$input_constraints

Ad-hoc data structure representing a list of gui-widgets (free text fields and drop-down lists), a helper for entering inputs conforming to a given $nfa.

DATA STRUCTURES AND SUBROUTINES

Each of the documented subroutines can be imported, for instance use ERE qw(&ere_to_nfa &nfa_match);.

Character class

WARNING: $char_classes must be created exclusively by char_to_cc() or interval_list_to_cc() for equivalent character classes to be always the same array reference. For the same reason, $char_classes must never be mutated.

In this implementation, the state transitions of a $nfa are based upon character classes (not single characters). A character class is an ordered list of disjunct, non-mergeable intervals (over unicode code points, i.e. positive integers).

$char_class = [
    [ $low_0, $high_0 ] # $interval_0
  , [ $low_1, $high_1 ] # $interval_1
  , ...
]

Constraints:

1:  0 <= $$char_class[$i][0]                          (0 <= low)
2:  $$char_class[$i][1] <= MAX_CHAR                   (high <= MAX_CHAR)
3:  $$char_class[$i][0] <= $$char_class[$i][1]        (low <= high)
4:  $$char_class[$i][1] + 1 <  $$char_class[$i+1][0]  (non mergeable)

Exceptions (anchors used only in the parsing phase only):

begin         : [ -2, -1 ]
end           : [ -3, -2 ]
begin or end  : [ -3, -1 ]

Immediately after parsing, such pseudo-character classes are removed by nfa_resolve_anchors().

char_to_cc($c)

Returns the unique $char_class equivalent to [[ord($c), ord($c)]].

interval_list_to_cc($interval_list)

$interval_list is an arbitrary list of intervals. Returns the unique $char_class whose reunion of intervals is the same set as the reunion of the intervals of $interval_list.

Example:

interval_list_to_cc([[102, 112], [65, 90], [97, 102], [113, 122]])
returns [[65, 90], [97, 122]]
(i.e [f-p]|[A-Z]|[a-f]|[q-z] => [A-Z]|[a-z])

Note that both $interval_list and $char_class are lists of intervals, but only $char_class obeys the constraints above, while $interval_list does not.

Remark also that interval_list_to_cc($char_class) is the identity (returns the same reference as given) on $char_classes returned by either interval_list_to_cc() or char_to_cc().

cc_union(@char_classes)

Returns the unique $char_class containing all characters of all given @char_classes.

Nfa

WARNING: nfa_xxx() routines are destructive, the $nfa references given as arguments will not be valid $nfa any more. Furthermore, the same $nfa reference must be used only once as argument. For instance, for concatenating a $nfa with itself, nfa_concat(nfa, nfa) does not work; instead, nfa_concat($nfa, nfa_clone($nfa)) must be used; or even nfa_concat(nfa_clone($nfa), nfa_clone($nfa) if the original $nfa is to be used further.

$nfa = [ $state_0, $state_1, ... ]

$state = [
    $accepting
  , $transitions
]

$transitions = [
    [ $char_class_0 => $state_ind_0 ]
  , [ $char_class_1 => $state_ind_1 ]
  , ...
]

In the same $transition, $state_ind_i are pairwise different and are valid indexes of @$nfa. There is exactly one initial state at index 0.

nfa_clone(@nfas)

Maps each of the given @nfas to a clone.

nfa_quant($in_nfa, $min, $max)

Precondition: 0 <= $min && ( $max eq '' || $min <= $max)

Returns $out_nfa, a $nfa computed from $in_nfa.

Let L be the language accepted by $in_nfa and M the language accepted by $out_nfa. Then a word m belongs to M if and only if and ordered list (l_1, ..., l_r) of words belonging to L exists such that:

$min <= r
and ($max eq '' or r <= $max)
and m is the concatenation of (l_1, ..., l_r)

Examples with $in_nfa being a $nfa accepting '^a$':

nfa_quant($in_nfa, 2, 4 ) accepts '^a{2,4}$'
nfa_quant($in_nfa, 0, '') accepts '^a{0,}$' (i.e. '^a*$')
nfa_concat(@in_nfas)

Returns $out_nfa, a $nfa computed from @in_nfas.

Let r be the number of given @in_nfas, L_i the language accepted by $in_nfas[$i] and M the language accepted by $out_nfa. Then a word m belongs to M if and only if an ordered list (l_1, ..., l_r) of words exists, l_i belonging to L_i, such that m is the concatenation of (l_1, ..., l_r).

nfa_union(@in_nfas)

Returns $out_nfa, a $nfa computed from @in_nfas.

$out_nfa accepts a word w if and only if at least one of @in_nfas accepts w.

nfa_inter(@in_nfas)

Returns $out_nfa, a $$nfa computed from @in_nfas.

$out_nfa accepts a word w if and only if each of @in_nfas accepts w.

nfa_match($in_nfa, $str)

Returns true if and only if $in_nfa accepts $str.

nfa_isomorph($nfa1, $nfa2)

Returns true if and only if the labeled graphs represented by $nfa1 and $nfa2 are isomorph. While isomorph $nfas accept the same language, the converse is not true.

nfa_to_dfa($in_nfa)

Compute a deterministic finite automaton from $in_nfa (powerset construction).

The data structure of a deterministic finite automaton (dfa) is the same as that of a non-deterministic one, but it is further constrained: For each state and each unicode character there exist exactly one transition (i.e. a pair (char_class, $state_index)) matching this character.

Note that the following constraint hold for both a $dfa and a $nfa: For each pair of state p1 and p2, there exists at most one transition from p1 to p2 (artefact of this implementation).

dfa_to_min_dfa($in_dfa)

Computes a minimal deterministic $dfa from the given $in_dfa (Hopcroft's algorithm).

Note that the given $in_dfa must be a $dfa, as returned from nfa_to_dfa(), and not a mere $nfa.

Myhill-Nerode theorem: two minimal dfa accepting the same language are isomorph (i.e. nfa_isomorph() returns true).

Tree

$tree = [ $star, [ $alt_0, $alt_1, ... ] ]
     or $char_class # ref($char_class) eq CHAR_CLASS
     or undef # accepting nothing
$alt = [ $tree_0, $tree_1, ... ]

A $tree is a hierarchical data structure used as intermediate form for regular expression generation routines.

Similar to a parse tree, except that the $trees described here are not the direct result of the parsing routines ere_to_xxx(); indeed, the parsing routines generate a $nfa, which then can be converted to a $tree.

A string is spanned by $tree = [$star, [ $alt_0, $alt_1, ... ] ] if it is spanned by one of the $alt_i (if $star is false) of a repetition thereof (if $star is true).

A string is spanned by $alt = [ $tree_0, $tree_1, ...] if it is the concatenation of @substrings, each $substrings[$i] being spanned by $$alt[$i].

nfa_to_tree($nfa)

Converts a $nfa to a $tree. Returns undef if the $nfa accepts nothing (not even the empty string).

tree_to_regex($tree, $to_perlre)

Converts a $tree to an $ere (if $to_perlre is false) or to a $perlre (if $to_perlre is true).

Input constraints

$input_constraints = [ $input_constraint_0, $input_constraint_1, ... ]
$input_constraint  = [ 'word_0', 'word_1', ... ]  (drop down)
                  or 'free_text'                  (free text)
tree_to_input_constraints($tree)

Converts a $tree to a pair ($input_constraints, $split_str).

$split_perlre is a compiled perl regular expression splitting a string according to $input_constraints. This $perlre matches if and only if each drop down can be assigned a value; then $str =~ $perlre in list context returns as many values as @$input_constraints.

Ere

An $ere is a perl string.

The syntax an $ere is assumed to follow is based on POSIX ERE (else the ere_to_xxx() routines will die()).

Unsupported POSIX features: back-references, equivalence classes [[=a=]], character class [[:digit:]], collating symbols [[.ch.]].

) is always a special character. POSIX says that ) is a normal character if there is no matching (.

There is no escape sequences such as \t for tab or \n for line feed. POSIX does not specify such escape sequences neither.

\ before a non-special character is ignored (except in bracket expressions). POSIX does not allow it.

The empty string is legal in alternations ((|a) is equivalent to (a?)). POSIX does not allow it. The (|a) form is generated by the xxx_to_ere() routines (avoiding quantifiers other than *).

[a-l-z] is interpreted as ([a-l] | - | z) (but it is discouraged to rely upon this implementation artefact). POSIX says that the interpretation of this construct is undefined.

In bracket expressions, \ is a normal character, thus ] as character must occur first, or second after a ^ (POSIX compliant, but possibly surprising for perl programmers).

All unicode characters supported by perl are allowed as litteral characters.

ere_to_nfa($ere)

Parses an $ere to a $nfa.

WARNING: the parsing routines, in particular ere_to_xxx(), die() on syntax errors; thus the caller may want to eval-trap such errors.

Shorthands

ere_to_tree($ere) := nfa_to_tree(ere_to_nfa($ere))
ere_to_regex($ere, $to_perlre) := tree_to_regex(ere_to_tree($ere), $to_perlre)
nfa_to_regex($nfa, $to_perlre) := tree_to_regex(nfa_to_tree($nfa), $to_perlre)
ere_to_input_constraints($ere) := tree_to_input_constraints(ere_to_tree($ere))
nfa_to_input_constraints($nfa) := tree_to_input_constraints(nfa_to_tree($nfa))
nfa_to_min_dfa($nfa) := dfa_to_min_dfa(nfa_to_dfa($nfa))

AUTHOR

Loïc Jonas Etienne <loic.etienne@tech.swisssign.com>

COPYRIGHT and LICENSE

Artistic License 2.0 http://www.perlfoundation.org/artistic_license_2_0