NAME

Language::FormulaEngine::Parser - Create parse tree from an input string

VERSION

version 0.07

SYNOPSIS

my $parse_tree= Language::FormulaEngine::Parser->new->parse($string);

DESCRIPTION

This class scans tokens from an input string and builds a parse tree. In compiler terminology, it is both a Scanner and Parser. It performs a top-down recursive descent parse, because this is easy and gives good error messages. It only parses strings, but leaves room for subclasses to implement streaming. By default, the parser simply applies a Grammar to the input, without checking whether the functions or variables exist, but can be subclassed to do more detailed analysis during the parse.

The generated parse tree is made up of Function nodes (each infix operator is converted to a named function) and each Function node may contain Symbols, Strings, Numbers, and other Function nodes. The parse tree can be passed to the Evaluator for instant execution, or passed to the Compiler to generate an optimized perl coderef. The parse tree is lightweight, and does not include token/context information; this could also be added by a subclass.

PUBLIC API

parse

Parse a new input text, updating all derived attributes with the result of the operation. It returns the value of "parse_tree" (which is undef if the parse failed). On failure, the exception is stored in "error" and other attributes like "token_pos" may contain useful diagnostic information.

parse_tree

This holds the generated parse tree, or undef if the parse failed. See "Parse Nodes".

error

This is undef if the parse succeeded, else an error message describing the syntax that ended the parse.

functions

A set (hashref) of all function names encountered during the parse.

symbols

A set (hashref) of all non-function symbols encountered. (variables, constnts, etc.)

reset

Clear the results of the previous parse, to re-use the object. Returns $self for chaining.

deparse

my $formula_text= $parser->deparse($tree);

Return a canonical formula text for the parse tree, or a parse tree that you supply.

EXTENSIBLE API

These methods and attributes are documented for purposes of subclassing the parser.

input

The input string being scanned. Code within the parser should access this as $self->{input} for efficiency.

input_pos

Shortcut for pos($self->{input}).

token_type

Type of current token scanned from input. Code within the parser should access this as $self->{token_type} for efficiency.

token_value

Value of current token scanned from input, with escape sequences and etc resolved to a sensible perl value. Code within the parser should access this as $self->{token_value} for efficiency.

token_pos

An offset within input where this token started. Code within the parser should access this as $self->{token_pos} for efficiency.

next_token

Advance to the next token, replacing the values of token_ variables and updating input_pos. Returns the token_type, of which all are true except EOF which has a type of 0, so this also means the function returns true if it parsed a token and false if it reached EOF. It dies if no token could be parsed. If you call next_token again after the eof token, it throws an exception.

This method is a wrapper around "scan_token". Override that method to add new token types.

scan_token

Pattern-match the next token, and either return $type => $value or an empty list if the syntax is invalid. This is intended to be overridden by subclasses.

consume_token

return $self->consume_token if $self->{token_type} eq $desired_type;

This is a shorthand for returning the current token_value while also calling next_token.

token_context

my $text= $self->token_context(%options);

Default behavior generates a string like:

"'blah blah' on line 15, char 12"

Passing token_context(multiline => 1) generates a string like

"Expected something else at line 15, char 16\n" .
"blah blah blah token blah blah\n" .
"               ^^^^^\n"

Multiline additionally takes arguments as described in "format_context_multiline" in Language::FormulaEngine::Parser::ContextUtil.

GRAMMAR

Parse Rules

The default grammar implements the following rules:

expr      ::= or_expr
or_expr   ::= and_expr ( 'or' and_expr )*
and_expr  ::= not_expr ( 'and' not_expr )*
not_expr  ::= ( 'not' | '!' ) cmp_expr | cmp_expr
cmp_expr  ::= sum_expr ( ( '=' | '==' | '<>' | '\u2260' | '<' | '<=' | '>' | '>=' ) sum_expr )*
sum_expr  ::= prod_expr ( ('+' | '-') prod_expr )*
prod_expr ::= ( unit_expr ('*' | '/') )* unit_expr
unit_expr ::= '-' unit_expr | Identifier '(' list ')' | '(' (expr|list) ')' | Identifier | Number | String
list      ::= expr ( ',' expr )* ','?

ident, num, str, and all the punctuation symbols are tokens.

The parser uses a Recursive Descent algorithm implemented as the following method calls. Each method consumes tokens from $self and return a "PARSE NODES":

parse_expr
parse_or_expr
parse_and_expr
parse_not_expr
parse_cmp_expr
parse_sum_expr
parse_prod_expr
parse_unit_expr
parse_list

Token Types

'Number'

All the common decimal representations of integers and floating point numbers which perl can parse. Optional decimals and decimal point followed by decimals and optional exponent, ending at either the end of the input or a non-alphanumeric.

'String'

A single-quoted or double-quoted string, treating a double occurrence of the quote character to mean a literal quote character. ("Pascal style")

'apostrophes are''nt hard'

There are no escape sequences though, so to get control characters or awkward unicode into a string you need something like:

concat("smile ",char(0x263A))

which depends on those functions being available in the namespace.

Keywords...

Keywords include the "word" tokens like 'OR', but also every text literal seen in a parse rule such as operators and punctuation. The token_type of the keyword is the canonical version of the keyword, and the token_value is the actual text that was captured. The pattern matches the longest keyword possible.

'Identifier'

Any alpha (or underscore) followed by any run of alphanumerics, (including underscore and period).

Customizing the Token Scanner

The tokens are parsed using a series of regex tests. The regexes and the code that handles a match of that regex are found in package attribute "scanner_rules". These regexes and code fragments get lazily compiled into a package method on the first use (per package). Meanwhile, several of those regex are built from other package attributes.

scanner_rules

This package method returns a list (not arrayref) of ordered elements of the form [ $name, $regex, $code_fragment, \%vars ]. You can subclass this method to inspect the rules (probably based on $name) and replace the regexes, or alter the handler code, or add/remove your own rules. The regexes are attempted in the order they appear in this list. You do not need to use "\G" or "/gc" on these regexes because those are added automatically during compilation.

keyword_map

This package method returns a hashref of all known keywords, mapped to their canonical form. So for instance, a key of '<>' with a value of '!='. These tokens automatically become the scanner rule named Keywords. In turn, the contents of this hashref include the "cmp_operators", "math_operators", "logic_operators", and "list_operators" which can be overridden separately.

This method is called once during the compilation of "scan_token", and the result is then made into a constant and referenced by the compiled method, so dynamic changes to the output of this method will be ignored.

cmp_operators

Package method that returns a list of comparison operators, like '<', '>=', etc.

math_operators

Package method that returns a list of math operators, like '*', '+', etc.

logic_operators

Package method that returns a list of keywords like 'and', 'or', etc.

list_operators

Package method that returns a list of '(', ')', ','

Parse Nodes

The parse tree takes a minimalist approach to node classification. In this default implementation, number values, string values, and symbolic references have just a simple wrapper around the value, and function calls are just a pair of function name and list of arguments. All language operators are represented as function calls.

A blessed node only needs to support one method: ->evaluate($namespace).

The class name of the blessed nodes should be ignored. A function is anything which can("function_name"), a string is anything which can("string_value"), a number is anything which can("number_value") and a symbolic reference is anything which can("symbolic_name").

Subclasses of Parser should implemnt new node types as needed. You probable also need to update "deparse".

The parser rules (parse_X_expr methods) create nodes by the following methods on the Parser class, so that you can easily subclass Parser and override which class of node is getting created.

new_call
$node= $parser->new_call( $function_name, $parameters );

Generate a node for a function call. The returned node has attributes function_name and parameters

new_symbol
$node= $parser->new_symbol($symbol_name);

A reference to a symbolic value (i.e. variable or constant). It has one attribute symbol_name.

new_string
$node= $parser->new_string($string_value);

A string literal. It has an attribute string_value holding the raw value.

new_number
$plain_scalar= $parser->new_number($value);

A numeric constant. It has an attribute number_value holding the raw value.

get_negative
$negative_node= $parser->get_negative( $node );

Utility method to get the "opposite of" a parse node. By default, this wraps it with the function 'negative', unless it already was that function then it unwraps the parameter. It performs simple negation on numbers.

AUTHOR

Michael Conrad <mconrad@intellitree.com>

COPYRIGHT AND LICENSE

This software is copyright (c) 2023 by Michael Conrad, IntelliTree Solutions llc.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.