NAME

Parse::Token - Definition of tokens used by Parse::Lex

SYNOPSIS

require 5.005;

use Parse::Lex;
@token = qw(
    ADDOP    [-+]
    INTEGER  [1-9][0-9]*
   );

$lexer = Parse::Lex->new(@token);
$lexer->from(\*DATA);

$content = $INTEGER->next;
if ($INTEGER->status) {
  print "$content\n";
}
$content = $ADDOP->next;
if ($ADDOP->status) {
  print "$content\n";
}
if ($INTEGER->isnext(\$content)) {
  print "$content\n";
}
__END__
1+2

DESCRIPTION

The Token package defines the lexemes used by Parse::Lex or Parse::CLex. The Lex::new() method of the Parse::Lex package indirectly creates a Parse::Token instance for each recognized lexeme. The methods next and isnext of the Token package permit easily interfacing the lexical analyzer with a recursive-descent syntactic analyzer (parser). For interfacing with byacc, see the Parse::YYLex package.

This package is included indirectly via use Parse::Lex.

Methods

action

Returns the anonymous subroutine defined within the Parse::Token object.

factory LIST

Creates a list of Parse::Token objects from a list of token specifications. The list can also include objects of class Parse::Token or of a class derived from it. Can be used as a class method or instance method.

The factory(LIST) method can be used to create a set of tokens which are not within the analysis automaton. This method carries out two operations: 1) it creates the objects based on the specifications given in LIST (see the new() method), and 2) it imports the created objects into the calling package.

You could for example write:

%keywords = 
  qw (
      PROC  undef
      FUNC  undef
      RETURN undef
      IF    undef
      ELSE  undef
      WHILE undef
      PRINT undef
      READ  undef
     );
Parse::Token->factory(%keywords);

and install these tokens in a symbol table in the following manner:

foreach $name (keys %keywords) {
  $symbol{"\L$name"} = [${$name}, ''];
}

${$name} is the Parse::Token object.

During the lexical analysis phase, you can use the tokens in the following manner:

qw(IDENT [a-zA-Z][a-zA-Z0-9]*),  sub {		      
   $symbol{$_[1]} = [] unless defined $symbol{$_[1]};
   my $type = $symbol{$_[1]}[0];
   $lexer->setToken((not defined $type) ? $VAR : $type);
   $_[1];  # THE TOKEN TEXT
 }

This permits indicating that any symbol of unknown type is a variable.

In this example we have used $_[1] which corresponds to the text recognized by the regular expression. This text is what is returned by the anonymous subroutine.

get EXPR

get obtains the value of the attribute named by the result of evaluating EXPR. You can also use the name of the attribute as a method name.

getText

Returns the character string that was recognized by means of this Parse::Token object.

Same as the text() method.

isnext EXPR
isnext

Returns the status of the token. The consumed string is put into EXPR if it is a reference to a scalar.

name

Returns the symbolic name of the Parse::Token object.

next

Activate searching for the lexeme defined by the regular expression contained in the object. If this lexeme is recognized on the character stream to analyze, next returns the string found and sets the status of the object to true.

new SYMBOL_NAME, REGEXP, SUB

Creates an object of the Parse::Token type. The arguments of the new method are: a symbolic name, a regular expression, and an anonymous subroutine.

REGEXP is either a simple regular expression, or a reference to an array containing from one to three regular expressions. In the latter case the lexeme can span several lines. For example, it can be a character string delimited by quotation marks, comments in a C program, etc. The regular expressions are used to recognize:

1. The beginning of the lexeme,

2. The "body" of the lexeme; if this second expression is missing, Parse::Lex uses "(?:.*?)",

3. the end of the lexeme; if this last expression is missing then the first one is used. (Note! The end of the lexeme cannot span several lines).

Example:

qw(STRING), [qw(" (?:[^"\\\\]+|\\\\(?:.|\n))* ")],

These regular expressions can recognize multi-line strings delimited by quotation marks, where the backslash is used to quote the quotation marks appearing within the string. Notice the quadrupling of the backslash.

Here is a variation of the previous example which uses the s option to include newline in the characters recognized by ".":

qw(STRING), [qw(" (?s:[^"\\\\]+|\\\\.)* ")],

(Note: it is possible to write regular expressions which are more efficient in terms of execution time, but this is not our objective with this example.)

The anonymous subroutine is called when the lexeme is recognized by the lexical analyzer. This subroutine takes two arguments: $_[0] contains the Parse::Token object, and $_[1] contains the string recognized by the regular expression. The scalar returned by the anonymous subroutine defines the character string memorized in the Parse::Token object.

In the anonymous subroutine you can use the positional variables $1, $2, etc. which correspond to the groups of parentheses in the regular expression.

regexp

Returns the regular expression of the Token object.

set LIST

Allows marking a Token object with a list of attribute-value pairs.

An attribute name can be used as a method name.

setText EXPR

The value of EXPR defines the character string associated with the lexeme.

Same as the text(EXPR) method.

status EXPR
status

Indicates if the last search of the lexeme succeeded or failed. status EXPR overrides the existing value and sets it to the value of EXPR.

text EXPR
text

text() Returns the character string recognized by means of the Token object. The value of EXPR sets the character string associated with the lexeme.

trace OUTPUT
trace

Class method which activates/deactivates a trace of the lexical analysis.

OUTPUT can be a file name or a reference to a filehandle to which the trace will be directed.

ERROR HANDLING

To handle the cases of nonrecognition of lexemes you can define a special Token object at the end of the list of tokens which defines the lexical analyzer. If the search for this token succeeds it is then possible to call a subroutine reserved for error handling.

FUTURE CHANGES

Subclasses of the Parse::Token class are being defined. They will permit recognizing specific structures such as, for example, strings within double-quotes, C comments, etc. Here are the subclasses which I plan to create:

Parse::Token::Simple : for defining 'ordinary' tokens.

Parse::Token::Multiline : for defining tokens which may necessitate reading additional data.

Parse::Token::Nested : for recognizing nested structures such as parenthesized expressions.

Parse::Token::Delimited : for recognizing, for example, strings within double-quotes.

The names of these classes as proposed above may be changed if you wish to suggest alternatives.

AUTHOR

Philippe Verdret. Documentation translated to English by Vladimir Alexiev and Ocrat.

ACKNOWLEDGMENTS

Version 2.0 owes much to suggestions made by Vladimir Alexiev. Ocrat has significantly contributed to improving this documentation.

REFERENCES

Friedl, J.E.F. Mastering Regular Expressions. O'Reilly & Associates 1996.

Mason, T. & Brown, D. - Lex & Yacc. O'Reilly & Associates, Inc. 1990.

COPYRIGHT

Copyright (c) 1995-1999 Philippe Verdret. All rights reserved. This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.