Why not adopt me?
NAME
Text::TokenStream::Lexer - reusable lexer for token-stream scanning
SYNOPSIS
my $lexer = Text::TokenStream::Lexer->new(
whitespace => [qr/\s+/, qr/\# [^\n]* (?:\n|\z)/x],
rules => [
word => qr/\w+/,
sym => qr/[^\w\s\#]+/,
],
);
my $token = $lexer->next_token(\$input_text);
DESCRIPTION
A lexer instance is constructed by specifying regexes that match individual parts of the input text. Each regex is associated with a token type that will be used to distinguish the tokens found. The regexes are tried in the order they're given in the "rules"
attribute; this means, for example, that you can have a keyword
rule that matches any of a list of specified keywords, followed by an identifier
rule that matches arbitrary identifiers, even if keywords have the same syntax as identifiers.
(In actual fact, the regexes are preprocessed into a form that the regex engine can handle more easily, and only one regex match operation is performed to extract each token. This should be completely transparent to the caller.)
A lexer will attempt to skip whitespace before scanning each token; to do that, it uses a separate set of regexes, in the "whitespace"
attribute.
CONSTRUCTOR
This class uses Moo, and inherits the standard new
constructor.
ATTRIBUTES
rules
Required; read-only. Array ref of (identifier, rule) pairs: each rule is a regex (or a literal string), that will be matched at the current position in the input, and the preceding identifier will be used as the type of the token, if this rule matches.
If a rule regex has any named captures, the contents of those captures will be preserved in the value returned by "next_token"
.
The regexes will be implicitly anchored to the next match position in the string being examined, so you should not add any initial anchor.
It is the caller's responsibility to ensure that the rules match every possible input.
whitespace
Read-only; defaults to empty array ref. Array ref of rule pairs, where each rule is a regex (or literal string), that will be treated as whitespace. It will typically be a good idea to include comments (if needed in your language) in this attribute.
The regexes will be implicitly anchored to the next match position in the string being examined, so you should not add any initial anchor.
OTHER METHODS
next_token
Takes one argument, which is a reference to a string. First attempts to "skip_whitespace"
on the referenced string, and returns undef
if the string is empty after any whitespace. Then attempts to match each of the "rules"
against the remaining part of the string. If no rule matches, throws an exception. Otherwise, returns a hashref containing the following elements:
type
-
The identifier corresponding to the rule that matched
text
-
The text matched by the regex
cuddled
-
A boolean value, true iff the token was not preceded by whitespace
captures
-
A hashref of any named captures matched by the regex
skip_whitespace
Takes one argument, which is a reference to a string. If none of the "whitespace"
patterns match at the start of the referenced string, returns false. Otherwise, removes as many leading whitespace sequences as it can from the beginning of the referenced string, and returns true.
AUTHOR
Aaron Crane, <arc@cpan.org>
COPYRIGHT
Copyright 2021 Aaron Crane.
LICENCE
This library is free software and may be distributed under the same terms as perl itself. See http://dev.perl.org/licenses/.