NAME
Parse::Token
- Definition of tokens used by Parse::Lex
SYNOPSIS
require 5.005;
use Parse::Lex;
@token = qw(
ADDOP [-+]
INTEGER [1-9][0-9]*
);
$lexer = Parse::Lex->new(@token);
$lexer->from(\*DATA);
$content = $INTEGER->next;
if ($INTEGER->status) {
print "$content\n";
}
$content = $ADDOP->next;
if ($ADDOP->status) {
print "$content\n";
}
if ($INTEGER->isnext(\$content)) {
print "$content\n";
}
__END__
1+2
DESCRIPTION
The Token
package defines the lexemes used by Parse::Lex
or Parse::CLex
. The Lex::new()
method of the Parse::Lex
package indirectly creates a Parse::Token
instance for each recognized lexeme. The methods next
and isnext
of the Token
package permit easily interfacing the lexical analyzer with a recursive-descent syntactic analyzer (parser). For interfacing with byacc
, see the Parse::YYLex
package.
This package is included indirectly via use Parse::Lex
.
Methods
- action
-
Returns the anonymous subroutine defined within the
Parse::Token
object. - factory LIST
-
Creates a list of
Parse::Token
objects from a list of token specifications. The list can also include objects of classParse::Token
or of a class derived from it. Can be used as a class method or instance method.The
factory(LIST)
method can be used to create a set of tokens which are not within the analysis automaton. This method carries out two operations: 1) it creates the objects based on the specifications given in LIST (see thenew()
method), and 2) it imports the created objects into the calling package.You could for example write:
%keywords = qw ( PROC undef FUNC undef RETURN undef IF undef ELSE undef WHILE undef PRINT undef READ undef ); Parse::Token->factory(%keywords);
and install these tokens in a symbol table in the following manner:
foreach $name (keys %keywords) { $symbol{"\L$name"} = [${$name}, '']; }
${$name}
is theParse::Token
object.During the lexical analysis phase, you can use the tokens in the following manner:
qw(IDENT [a-zA-Z][a-zA-Z0-9]*), sub { $symbol{$_[1]} = [] unless defined $symbol{$_[1]}; my $type = $symbol{$_[1]}[0]; $lexer->setToken((not defined $type) ? $VAR : $type); $_[1]; # THE TOKEN TEXT }
This permits indicating that any symbol of unknown type is a variable.
In this example we have used
$_[1]
which corresponds to the text recognized by the regular expression. This text is what is returned by the anonymous subroutine. - get EXPR
-
get
obtains the value of the attribute named by the result of evaluating EXPR. You can also use the name of the attribute as a method name. - getText
-
Returns the character string that was recognized by means of this
Parse::Token
object.Same as the text() method.
- isnext EXPR
- isnext
-
Returns the status of the token. The consumed string is put into EXPR if it is a reference to a scalar.
- name
-
Returns the symbolic name of the
Parse::Token
object. - next
-
Activate searching for the lexeme defined by the regular expression contained in the object. If this lexeme is recognized on the character stream to analyze,
next
returns the string found and sets the status of the object to true. - new SYMBOL_NAME, REGEXP, SUB
-
Creates an object of the
Parse::Token
type. The arguments of thenew
method are: a symbolic name, a regular expression, and an anonymous subroutine.REGEXP is either a simple regular expression, or a reference to an array containing from one to three regular expressions. In the latter case the lexeme can span several lines. For example, it can be a character string delimited by quotation marks, comments in a C program, etc. The regular expressions are used to recognize:
1. The beginning of the lexeme,
2. The "body" of the lexeme; if this second expression is missing,
Parse::Lex
uses "(?:.*?)",3. the end of the lexeme; if this last expression is missing then the first one is used. (Note! The end of the lexeme cannot span several lines).
Example:
qw(STRING), [qw(" (?:[^"\\\\]+|\\\\(?:.|\n))* ")],
These regular expressions can recognize multi-line strings delimited by quotation marks, where the backslash is used to quote the quotation marks appearing within the string. Notice the quadrupling of the backslash.
Here is a variation of the previous example which uses the
s
option to include newline in the characters recognized by ".
":qw(STRING), [qw(" (?s:[^"\\\\]+|\\\\.)* ")],
(Note: it is possible to write regular expressions which are more efficient in terms of execution time, but this is not our objective with this example.)
The anonymous subroutine is called when the lexeme is recognized by the lexical analyzer. This subroutine takes two arguments:
$_[0]
contains theParse::Token
object, and$_[1]
contains the string recognized by the regular expression. The scalar returned by the anonymous subroutine defines the character string memorized in theParse::Token
object.In the anonymous subroutine you can use the positional variables
$1
,$2
, etc. which correspond to the groups of parentheses in the regular expression. - regexp
-
Returns the regular expression of the
Token
object. - set LIST
-
Allows marking a Token object with a list of attribute-value pairs.
An attribute name can be used as a method name.
- setText EXPR
-
The value of
EXPR
defines the character string associated with the lexeme.Same as the
text(EXPR)
method. - status EXPR
- status
-
Indicates if the last search of the lexeme succeeded or failed.
status EXPR
overrides the existing value and sets it to the value of EXPR. - text EXPR
- text
-
text()
Returns the character string recognized by means of theToken
object. The value ofEXPR
sets the character string associated with the lexeme. - trace OUTPUT
- trace
-
Class method which activates/deactivates a trace of the lexical analysis.
OUTPUT
can be a file name or a reference to a filehandle to which the trace will be directed.
ERROR HANDLING
To handle the cases of nonrecognition of lexemes you can define a special Token
object at the end of the list of tokens which defines the lexical analyzer. If the search for this token succeeds it is then possible to call a subroutine reserved for error handling.
FUTURE CHANGES
Subclasses of the Parse::Token
class are being defined. They will permit recognizing specific structures such as, for example, strings within double-quotes, C comments, etc. Here are the subclasses which I plan to create:
Parse::Token::Simple
: for defining 'ordinary' tokens.
Parse::Token::Multiline
: for defining tokens which may necessitate reading additional data.
Parse::Token::Nested
: for recognizing nested structures such as parenthesized expressions.
Parse::Token::Delimited
: for recognizing, for example, strings within double-quotes.
The names of these classes as proposed above may be changed if you wish to suggest alternatives.
AUTHOR
Philippe Verdret. Documentation translated to English by Vladimir Alexiev and Ocrat.
ACKNOWLEDGMENTS
Version 2.0 owes much to suggestions made by Vladimir Alexiev. Ocrat has significantly contributed to improving this documentation.
REFERENCES
Friedl, J.E.F. Mastering Regular Expressions. O'Reilly & Associates 1996.
Mason, T. & Brown, D. - Lex & Yacc. O'Reilly & Associates, Inc. 1990.
COPYRIGHT
Copyright (c) 1995-1999 Philippe Verdret. All rights reserved. This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.