NAME
RE - A regular expression base class
SYNOPSIS
use RE;
use DFA;
my $re = RE->new();
$re->set_re('a|b|(hi)*');
my $dfa = $re->to_dfa();
print $dfa->info(); # see stuff on DFA
DESCRIPTION
This module implements a regular expression parser, and supports the conversion of a RE to a deterministic finite automata. A homegrown recursive descent parser is used to build the parse tree, and the method used to conver the regular expression to a DFA uses no intermediate NFA.
Recursive Descent-safe Regex Grammar:
R -> O
O -> CO'
O' -> '|' CO' | epsilon
C -> SC'
C' -> .SC' | epsilon
S -> LS'
S' -> *S' | epsilon
L -> a | b | c |..| 0 | 1 | 2 |..| (R) | epsilon
Terminal symbols: a,b,c,..,z,0,1,2,..,9,|,*,(,)
NOTE: Concatenation operator, '.', is not a terminal symbol
and should not be included in the regex
FAQ:
Q: Does this support Perl regular expressions?
A: No, just the regular expression using the terminal symbols
listed above.
Valid terminal characters include:
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9 + - = , ? & [ ] { } . ~ ^ @ % $
: ; < >
Methods
new
-
Create a brand spaking new RE object; does not accept regex here
set_epsilon
-
Not really for public consumption, but could be in the future. Defines how epsilon (i.e., the null string) is represented in the parse tree.
get_epsilon_symbol
-
Returns the string representation of the null string
set_re
-
Defines a regular expression for RE to parse through
get_re
-
Returns the regular expression set by
set_re
as a string set_current
-
Meant for private consumption. Initializes the stack used to store what part of the regex has yet to be parsed
reset_current
-
Meant for private consumption. Initializes the stack used to store what part of the regex has yet to be parsed
get_current
-
Returns what is in the _CURRENT_STR stack
minimize
-
Minimizes the regular expression optimally
shrink
-
Shrinks RE using hueristics to create a 'small enough' equivalent expression
to_nfa
-
Not implemented, but will create an NFA using Thompson's Method; from there one could do a NFA->to_dfa, and compare the resulting dfa to the one from RE->to_dfa.
thompson
-
Guts of RE->to_nfa; uses depth first parse tree traversal
to_dfa
-
Currently BREAKS on a*(b|cd)m!!!!!!!!!!!!!!
Main driver which is used to convert the regular expression to a DFA; calls RE->parse() internally if _PARSE_TREE is !defined, so no need to call before this function.
get_transitions_on
-
Provides a way to get the transitions using the contents of the _FOLLOW_POS table
move
-
Called by RE->to_re to get the sub set of new states for each sub set of states during the sub state construction process of building the DFA
firstpos
-
Determines firt positions for all nodes in the a parse tree
lastpost
-
Determines the last postition for all nodes in the parse tree
followpos
-
Determines the first postition for all nodes in the parse tree
followpos
-
Returns hash containing follow position table
parse
-
Parses regular expressin set by RE->set_re, and stores the parse tree in _PARSE_TREE
cat_endmarker
-
Adds '#', or the end of regex marker to the parse tree; not for public consumption
match
-
Matches current terminal symbol with terminal character, and loads up the next lookahead character
lookahead
-
Returns value of current lookahead
nexttoken
-
Sets next token as lookahead
R
-
R -> O
O
-
O -> CO'
O_prime
-
O' -> '|'CO' | epsilon
C
-
C -> SC'
C_prime
-
C' -> .SC' | epsilon
S
-
S -> LS'
S_prime
-
S' -> *S' | epsilon
L
-
L -> a | b | c |..| 0 | 1 | 2 |..| (R)
get_next_pos
-
Returns the next position, used in creating leaf nodes for terminal symbols (minus null string)
get_curr_pos
-
Returns the current count of terminal symbols (minus null string)
set_parse_tree
-
Set parse tree
get_parse_tree
-
Return parse tree
get_terminals
-
Returns array of terminal symbols
is_terminal
-
Checks to see if given character is a terminal symbol
is_member
-
General subroutine used to test if an element is already in an array
get_symbols
-
Returns array of all symbols used in current regex
trace_on
-
Turns on tracing - allows to trace the recursive descent parsing
trace_off
-
Turns tracing off
trace
-
Returns value of _TRACE
toggle_cat_state
-
Toggles cat state instead of cat'ing a '.' to everything
get_cat_state
-
Returns $self->{_CAT_STATE} (1|0)
set_error
-
Increments error count for regex parsing
get_error
-
Returns error count
set_done
-
Sets done flag
done
-
Returns if done or not
DESTROY
-
Called automatically when object is destroyed either explicitly or automatically when references go to 0 or when the main program is finished
AUTHOR
Brett D. Estrade - <estrabd AT mailcan DOT com>
CAVEATS
Currently, all states are stored as labels. There is also no integrity checking for consistency among the start, final, and set of all states.
BUGS
Not saying it is bug free, just saying I haven't hit any yet :)
AVAILABILITY
Anonymous CVS Checkout at http://www.brettsbsd.net/cgi-bin/viewcvs.cgi/
ACKNOWLEDGEMENTS
This suite of modules started off as a homework assignment for a compiler class I took for my MS in computer science at the University of Southern Mississippi.
COPYRIGHT
This code is released under the same terms as Perl.