NAME
Parse::Lex
- Generator of lexical analyzers (Beta 2.01).
SYNOPSIS
require 5.004;
use Parse::Lex;
@token = (
qw(
ADDOP [-+]
LEFTP [\(]
RIGHTP [\)]
INTEGER [1-9][0-9]*
NEWLINE \n
),
qw(STRING), [qw(" (?:[^"]+|"")* ")],
qw(ERROR .*), sub {
die qq!can\'t analyze: "$_[1]"!;
}
);
Parse::Lex->trace; # Class method
$lexer = Parse::Lex->new(@token);
$lexer->from(\*DATA);
print "Tokenization of DATA:\n";
TOKEN:while (1) {
$token = $lexer->next;
if (not $lexer->eoi) {
print "Line $.\t";
print "Type: ", $token->name, "\t";
print "Content:->", $token->text, "<-\n";
} else {
last TOKEN;
}
}
__END__
1+2-5
"a multiline
string with an embedded "" in it"
an invalid string with a "" in it"
DESCRIPTION
The Parse::Lex
class creates lexical analyzers. A lexical analyzer is specified by means of a list of tokens passed as arguments to the new()
method.
Parse::Lex
works only with Perl 5.004 or higher. If you have an earlier version, use the Parse::CLex
subclass. The analyzers generated by these two classes use different analysis techniques:
1. Parse::Lex
uses pos()
together with \G
,
2. Parse::CLex
uses s///
and thus consumes the stream of characters to be analyzed.
Analyzers of the Parse::CLex
class do not allow the use of anchoring in regular expressions.
Tokens are objects of the Parse::Token
class, which comes with Parse::Lex
. The definition of a token usually comprises two arguments: a symbolic name (like INTEGER
), followed by a regular expression. If a sub ref (anonymous subroutine) is given as third argument, it is called when the token is recognized. Its arguments are the Parse::Token
object and the string recognized by the regular expression. The anonymous subroutine's return value is used as the new string contents of the Parse::Token
object.
The order in which the lexical analyzer examines the regular expressions is determined by the order in which these expressions are passed as arguments to the new()
method. The token returned by the lexical analyzer corresponds to the first regular expression which matches the beginning of the stream of characters to be analyzed (this strategy is different from that adopted by the lexical analyzer Lex, which returns the longest match possible out of all that can be recognized). The token is an object of the Parse::Token
class.
The lexical analyzer can recognize tokens which span multiple records. If the definition of the token comprises more than one regular expression (placed within a reference to an anonymous array), the analyzer reads as many records as required to recognize the token (see the documentation for the Parse::Token
class). When the start pattern is found, the analyzer looks for the end, and if necessary, reads more records. No backtracking is done in case of failure.
The analyzer can be used to analyze an isolated character string or a stream of data coming from a file handle. At the end of the input data the analyzer returns a Token object equal to Token::EOI
(End Of Input).
Start Conditions
You can associate start conditions with the rules for recognizing the tokens that comprise your lexical analyzer (this is similar to what Flex provides). When start conditions are used, the rule which succeeds is no longer necessarily the first rule matching the input stream.
A token symbol may be preceded by a start condition specifier for the associated recognition rule. For example:
qw(C1:TERMINAL_1 REGEXP), sub { # associated action },
qw(TERMINAL_2 REGEXP), sub { # associated action },
Symbol TERMINAL_1
will be recognized only if start condition C1
is active. Start conditions are activated/deactivated using the start(CONDITION_NAME)
and end(CONDITION_NAME)
methods.
start('INITIAL')
resets the analysis automaton.
Start conditions can be combined using AND/OR operators as follows:
C1:SYMBOL condition C1
C1:C2:SYMBOL condition C1 AND condition C2
C1,C2:SYMBOL condition C1 OR condition C2
There are two types of start conditions: inclusive and exclusive, which are declared by class methods inclusive()
and exclusive()
respectively. With an inclusive start condition, all rules are active regardless of whether or not they are qualified with the start condition. With an exclusive start condition, only the rules qualified with the start condition are active; all other rules are deactivated.
Example (borrowed from the documentation of Flex):
use Parse::Lex;
@token = (
'EXPECT', 'expect-floats', sub {
$lexer->start('expect');
$_[1]
},
'expect:FLOAT', '\d+\.\d+', sub {
print "found a float: $_[1]\n";
$_[1]
},
'expect:NEWLINE', '\n', sub {
$lexer->end('expect') ;
$_[1]
},
'NEWLINE2', '\n',
'INT', '\d+', sub {
print "found an integer: $_[1] \n";
$_[1]
},
'DOT', '\.', sub {
print "found a dot\n";
$_[1]
},
);
Parse::Lex->exclusive('expect');
$lexer = Parse::Lex->new(@token);
Methods
- analyze EXPR
-
Analyzes EXPR and returns a list of pairs consisting of a token name followed by recognized text. EXPR can be a character string or a reference to a filehandle.
Examples:
@tokens = Parse::Lex->new(qw(PLUS [+] NUMBER \d+))->analyze("3+3+3"); @tokens = Parse::Lex->new(qw(PLUS [+] NUMBER \d+))->analyze(\*STREAM);
- buffer EXPR
- buffer
-
Returns the contents of the internal buffer of the lexical analyzer. With an expression as argument, places the result of the expression in the buffer.
It is not advisable to directly change the contents of the buffer without changing the position of the analysis pointer (
pos()
) and the value length of the buffer (length()
). - end EXPR
-
Deactivates condition EXPR.
- eoi
-
Returns TRUE when there is no more data to analyze.
- every SUB
-
Avoids having to write a reading loop in order to analyze a stream of data.
SUB
is an anonymous subroutine executed after the recognition of each token. For example, to lex the string "1+2" you can write:use Parse::Lex; $lexer = Parse::Lex->new( qw( ADDOP [-+] INTEGER \d+ )); $lexer->from("1+2"); $lexer->every (sub { print $_[0]->name, "\t"; print $_[0]->text, "\n"; });
The first argument of the anonymous subroutine is the Token object.
- exclusive LIST
-
Class method declaring the conditions present in LIST to be exclusive.
- flush
-
If saving of the consumed strings is activated,
flush()
returns and clears the buffer containing the character strings recognized up to now. This is only useful ifhold()
has been called to activate saving of consumed strings. - from EXPR
- from
-
from(EXPR)
allows specifying the source of the data to be analyzed. The argument of this method can be a string (or list of strings), or a reference to a filehandle. If no argument is given, returns the filehandle if defined, orundef
if input is a string. When an argumentEXPR
is used, the return value is the calling lexer object itself.By default it is assumed that data are read from
STDIN
.Example:
$lexer->from(\*DATA); $lexer->from('the data to be analyzed');
- getSub
-
getSub
returns an anonymous subroutine that performs the lexical analysis [the equivalent of yylex].Example:
my $token = ''; my $sub = $lexer->getSub; while (($token = &$sub()) ne $Token::EOI) { print $token->name, "\t"; print $token->text, "\n"; } # or my $token = ''; local *tokenizer = $lexer->getSub; while (($token = tokenizer()) ne $Token::EOI) { print $token->name, "\t"; print $token->text, "\n"; }
- getToken
-
Same as
token()
method. - hold EXPR
- hold
-
Activates/deactivates saving of the consumed strings. The return value is the current setting (TRUE or FALSE). Can be used as a class method.
You can obtain the contents of the buffer using the
flush
method, which also empties the buffer. - inclusive LIST
-
Class method declaring the conditions present in LIST to be inclusive.
- length EXPR
- length
-
length()
returns the length of the current record.length EXPR
sets the length of the current record. - line EXPR
- line
-
line
returns the line number of the current record.line EXPR
sets the value of the line number. Always returns 1 if a character string is being analyzed. Thereadline()
method increments the line number. - name EXPR
- name
-
name EXPR
lets you give a name to the lexical analyzer.name()
return the value of this name. - next
-
Causes searching for the next token. Return the recognized Token object. Returns the
Token::EOI
object at the end of the data.Examples:
$lexer = Parse::Lex->new(@token); print $lexer->next->name; # print the token type print $lexer->next->text; # print the token content
- new LIST
-
Creates and returns a new lexical analyzer. The argument of the method is a list of
Parse::Token
objects, or a list of triplets permitting their creation. The triplets consist of: the symbolic name of the token, the regular expression necessary for its recognition, and possibly an anonymous subroutine that is called when the token is recognized. For each triplet, an object of typeParse::Token
is created in the calling package. - offset
-
Returns the number of characters already consumed since the beginning of the analyzed data stream.
- pos EXPR
- pos
-
pos EXPR
sets the position of the beginning of the next token to be recognized in the current line (this doesn't work with analyzers of theParse::CLex
class).pos()
returns the number of characters already consumed in the current line. - readline
-
Reads data from the input specified by the
from()
method. Returns the result of the reading.Example:
use Parse::Lex; print STDERR "read and print one line\n"; $lexer = Parse::Lex->new(); while (not $lexer->eoi) { print $lexer->readline() }
- reset
-
Clears the internal buffer of the lexical analyzer and erases all tokens already recognized.
- restart
-
Reinitializes the analysis automaton. The only active condition becomes the condition
INITIAL
. - setToken TOKEN
-
Sets the token to
TOKEN
. Useful to requalify a token inside the anonymous subroutine associated with this token. - skip EXPR
- skip
-
EXPR
is a regular expression defining the token separator pattern (by default[ \t]+
).skip('')
sets this to no pattern. With no argument,skip()
returns the value of the pattern.skip()
can be used as a class method.Changing the skip pattern causes recompilation of the lexical analyzer.
- start EXPR
-
Activates condition EXPR.
- state EXPR
-
Returns the state of the condition represented by EXPR.
- token
-
Returns the object corresponding to the last recognized token. In case no token was recognized, return the special token named
DEFAULT
. - tokenClass EXPR
- tokenClass
-
Indicates which is the class of the tokens to be created from the list passed as argument to the
new()
method. If no argument is given, returns the name of the class. By default the class isParse::Token
. - trace OUTPUT
- trace
-
Class method which activates trace mode. The activation of trace mode must take place before the creation of the lexical analyzer. The mode can then be deactivated by another call of this method.
OUTPUT
can be a file name or a reference to a filehandle where the trace will be directed.
EXAMPLES
ctokenizer.pl - Scan a stream of data using the Parse::CLex
class.
tokenizer.pl - Scan a stream of data using the Parse::Lex
class.
every.pl - Use of the every
method.
BUGS
Parse::Lex
works only with Perl 5.004 or higher. If your version of Perl is earlier, use the Parse::CLex
class.
Analyzers of the Parse::CLex
class do not allow the use of regular expressions with anchoring.
SEE ALSO
The Parse::YYLex
class which interfaces Parse::Lex
with the byacc parser.
AUTHOR
Philippe Verdret. Documentation translated to English by Vladimir Alexiev and Ocrat.
ACKNOWLEDGMENTS
Version 2.0 owes much to suggestions made by Vladimir Alexiev. Ocrat has significantly contributed to improving this documentation.
REFERENCES
Friedl, J.E.F. Mastering Regular Expressions. O'Reilly & Associates 1996.
Mason, T. & Brown, D. - Lex & Yacc. O'Reilly & Associates, Inc. 1990.
FLEX - A Scanner generator (available at ftp://ftp.ee.lbl.gov/ and elsewhere)
COPYRIGHT
Copyright (c) 1995-1998 Philippe Verdret. All rights reserved. This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.