NAME

Text::Parser::Manual::ExtendedAWKSyntax - The ExAWK (extended AWK) syntax itself

VERSION

version 1.000

THE EXTENDED AWK LANGUAGE

So want to get started with writing your own parser based on Text::Parser. And the best place to start is to learn how to write parsing rules.

In this chapter, we only describe the ExAWK language syntax and features. But this is the main part. In fact, the remaining things are extremely simple. And when you see how intuitive the rules are, you'll wonder why this was not part of native Perl.

Parsing rules may be specified using the add_rule, BEGIN_rule, and END_rule methods. Alternatively, you may sub-class Text::Parser, and use the applies_rule syntax sugar from Text::Parser::RuleSpec. In either case, the syntax and form of the rules is what determines what your text parser does.

BASIC SYNTAX

The basic form of the ExAWK rule is like this:

if => 'condition', do => 'action'
        ## options: 
        ##   dont_record => 0|1
        ##   continue_to_next => 0|1

AWK programmers would recognize that it is similar to the basic syntax of AWK:

condition { action; }

Pay attention to the single quotes in the value of the if and do keys. This is important as you'll see below.

Similarities with AWK

Optional parts of a rule

Just as in AWK the condition can be as simple as a regular expression, or a complex boolean expression. In AWK you could do:

$ awk '/EMAIL:/ {print $2}' file.txt

In ExAWK you could write something like this to get something equivalent (Note the need for "\n"):

$parser->add_rule(
    if => 'm/EMAIL:/', do => 'print $2, "\n"'
);

Similar to AWK, in the Extended AWK language too, if the condition is specified, then the action is optional, and if the action is specified, then the condition is optional. The default condition in ExAWK is just like in AWK: true for each input line. So the following will simply print every line in a file:

$parser->add_rule(
    do => 'print'       # The default 'if' is true for all lines
);

Or else:

$parser->add_rule(
    if => 'm/^\d+/' # The default 'do' stores the whole line.
);

Field identifiers, and line identifier

AWK is very popular for its intuitive field identifiers $1, $2, $3 etc. ExAWK provides the same and much more.

In AWK, $0 identifies the whole line. The same is true in ExAWK. But in addition, the Perl in-built variable $_ also contains the line. So any in-built functions that take a missing parameter to be $_ will behave accordingly. This is how a terse rule like do => 'print' happens to work. The important difference between $0 and $_ is that $0 is just an identifier (i.e., you cannot change its value), whereas $_ is an actual variable which you can change. By changing $_, you cannot change the value of the current line, or the value of $0.

Similarly, other positional field identifiers like $1, $2, etc., are not variables. Even in AWK, they are not variables. They are just positional field identifiers. They represent an Rvalue and cannot be modified. So for example

$ awk '// {$1 = "something";}' file.txt

will not change anything. In the same way:

$parser->add_rule(do => '$1 = "something";');

In particular they are not the same as the native Perl regular expression field identifiers $1, $2 etc., which are used in regexp substitutions.

The positional field identifiers $1, $2 etc. mean something else inside the string expressions of ExAWK. Like AWK, $1 represents the first field, $2 represents the second field, and so on.

Note: In the UNIX implementation of AWK, the positional identifiers are limited to $9. In POSIX implementations this limitation has already been removed. In ExAWK also there is no limit to the number of positional identifiers.

Important note about quotes

You should always use single quotes ('') for your rule strings, and not double quotes (""). This is because sigils like $ get dereferenced inside double quotes, and $0, $1, etc., have no value in your main code.

Differences with AWK

Language

The first difference with AWK is the language. ExAWK condition and action strings are Perl. So for example, to compare strings you should use eq and not == like you would in AWK. The condition and action strings are transformed into regular Perl and compiled. So if they fail to compile, add_rule method will throw an exception.

Execution loop

In AWK, each rule is run for each line, even if the condition for a previous rule may be true. If the condition of a rule is true, the action is performed.

But in ExAWK, rules are executed until the condition of one rule is true. By default, the execution of further rules stops at that point. For example:

$parser->add_rule(if => '$1 =~ /^[#]/', dont_record => 1);

In this case, the moment a line leading with # is encountered, it is ignored, and no other rules are executed. But see the following code:

$parser->add_rule(if => '$1 =~ /^[#][!]/', continue_to_next => 1);
$parser->add_rule(
    if => '$1 =~ /perl$/',
    do => '$this->abort_reading; print "This is a perl script.\n";'
);
$parser->add_rule(
    if => '$1 =~ /bash$/',
    do => '$this->abort_reading; print "This is a bash script.\n";'
);
$parser->add_rule(
    if          => '$1 =~ /^[#]/ or $this->lines_parsed > 0',
    do          => '$this->abort_reading; print "Neither perl nor bash.\n";',
    dont_record => 1
);

Now if a file starts with #!, the condition for the first rule is met, and it immediately tests the next rule. If that condition is met, it will abort reading at that point and print the message. But if the condition for the second rule is not met, it will test the next rule. If that condition is also not met, then it will test the fourth rule. In it will surely meet the condition for the fourth rule (we could have skipped it), and will execute that rule. At this point it will stop because there is no continue_to_next option.

So in this way we can control the execution sequence.

Default action

We saw that only one of condition or action is required, the other may be omitted. The default condition is same as AWK. But the default action is different. The default action in AWK is print. Thus:

$ awk '/li/' file.txt

will print all the lines with 'li' somewhere in it. But in ExAWK, since it integrates with the Text::Parser class, the default action is to return the whole line.

if => 'm/li/'
    # returns each line containing 'li', to the parser
    # The parser then saves it as a record,
    # unless dont_record is true

If you want to print instead and not record anything, you need to specify that:

if => 'm/li/', do => 'print', dont_record => 1

ENHANCED FEATURES OF EXTENDED AWK LANGUAGE

Reverse field identifiers

To access fields from the end of the line, use identifiers ${-1}, ${-2}, etc. ${-1} is the last field, ${-2} is the penultimate field, and so forth.

Field range shortcuts

Sometimes, we want to access all the fields starting from the 2nd, or 3rd, leaving all the earlier ones. So we have a set of shortcuts to do all that. Below are some shortcut examples:

SHORTCUT     CODE EQUIVALENT                    MEANING
========     ================                   ========
${2+}        $this->join_range(1, -1)           Everything from second field as a string. Spaces will be collapsed to one space.
@{3+}        $this->field_range(2, -1)          Everything from third field as an array.
\@{2+}       [ $this->field_range(2, -1) ]      Arrayref containing everything from second field.

Automatic checks for NF

In AWK if you write:

$ awk '{print $4;}' text.txt

then all lines with 3 or less fields will print a blank line to the screen because $4 evaluates to empty string when there are less than 4 fields on a line. So you would get empty lines. To ensure you take only lines with 4 fields, you need to do:

$ awk 'NF>=4 {print $4;}' text.txt

But in ExAWK this is unnecessary. So the rule:

do => 'return $4;'

automatically sets up a pre-condition for the number of fields NF and ensures that each line being read has at least 4 fields. This works even for negative positional indicators. So the rule:

do => 'return ${-4};'

This ensures you don't run into undef records being saved in the parser.

Local variables

You can use any Perl local variables you want. For example:

do => 'my (@numbers) = @{3+};'

Note that @numbers above is accessible only within that rule action. It is not accessible outside of that do string.

Use any variable other than $this.

Suite of string and array utility functions

Perl anyway has more built-in functions that are very useful and better than their AWK counterparts. But in addition, CPAN has a lot of great modules with utility functions. ExAWK gives the programmer adds a few good utility functions, but also makes it very easy to add any other functions:

  • Scalar::Util : blessed, looks_like_number

  • String::Util : All functions here

  • List::Util : The following functions: reduce, any, all, none, notall, first, max, maxstr, min, minstr, product, sum, sum0, pairs, unpairs, pairkeys, pairvalues, pairfirst, pairgrep, pairmap, shuffle, uniq, uniqnum, and uniqstr

I have kept this list small to minimize Text::Parser dependencies. The user can import whatever functions they want from the package of their choice.

How to add other utility functions

Suppose you know of a very useful package (fictitiously named) Useful::Package. And let's say it has functions foo and bar that are very useful and operate on strings. And you wish to use these in your rules. Then do the following in your code:

use Import::Into;
Useful::Package->import::into('Text::Parser::Rule', qw(foo bar));
use Text::Parser;

my $parser = Text::Parser->new();
$parser->add_rule(if => 'bar($2)', do => 'return foo($1).foo($2);');

This means that the power of any new package on CPAN can be harnessed very easily.

THE FUN FEATURES

If you could use the parser object itself to store data, this would open up many possibilities. And that is precisely what this section is about.

The $this variable

To access the parser object, you can use $this inside the rule strings. Remember again, that the rule strings should be in single quotes (''). Here is an example rule using the $this variable to refer to the parser:

$parser->add_rule(
    if          => '$this->lines_parsed > 10',
    do          => '$this->abort_reading;',
    dont_record => 1,
);

Important Note on $this

$this is a real variable. If you modify its value, it will change. So be careful what you do with $this. If you save the $this to another variable in the hope that you can retrieve it later, remember that all positional field indicators and range shortcuts are entirely dependent on the this variable. If this variable is tampered with, you could get garbage results. You have been forewarned.

Stashed variables

The idea of storing data in the $this variable is the obvious next step. You get what are called "stashed variable"s. Internally these variables are just stored in a hash, but it is as if you are stashing away useful data. You can store any scalar, hashref or arrayref. Note that you can't store an array or hash itself, only arrayref or hashref.

Stashed variables begin with the tilde (~) character, followed by an alphabet or underscore (_. So for example:

if => '$1 eq "MARKER:"', do => '~info = $2; ~_secret = $3;'

In the above rule, ~info and ~_secret are stashed variables, which are accessible in other rules. You can set stashed variables in a BEGIN_rule, and access it in later rules. Here is an obvious example:

my $parser = Text::Parser->new();
$parser->BEGIN_rule( do => '~count = 0;' );
$parser->add_rule( if => '$1 eq "ERROR:"', do => '~count++;' );
$parser->read('/path/to/logfile.log');
print "Found ", $parser->stashed('count'), " errors in your logfile\n";

You may forget a stashed variable, and it will be lost for ever. Or you can simply clear the whole stash of variables by using clear_stash method.

All stashed variables are forgotten right before read starts reading the input. So you have a clean stash each time you call read.

$parser->read('/another/logfile.log');
print "Whereas, the other one had ", $parser->stashed('count'), " errors\n";

You can also have pre-stashed variables that persist across multiple read method calls. Read more about that here.

Adding and using new class attributes

When you sub-class Text::Parser you can get some very powerful features. You can do something like this:

package MyClass::Parser;

use Moose;
extends 'Text::Parser';
use Text::Parser::RuleSpec;

has section => (
    is   => 'ro',
    isa  => 'Str',
    lazy => 1,
);

has ids => (
    is      => 'ro',
    isa     => 'HashRef[ArrayRef[Str]]',
    default => sub {return {};},
    lazy    => 1,
    handles => {
        get_section => 'get',
        set_section => 'set',
        has_section => 'exists',
    },
);

sub add_id {
    my $self = shift;
    my $ids = $self->get_section($self->section);
    push @{$ids}, shift;
}

applies_rule find_section => (
    if => '$1 eq "SECTION:"',
    do => '$this->section($2); $this->set_section($2 => []);',
);

applies_rule name_in_section => (
    if => '$1 eq "ID"',
    do => '$this->add_id($2);'
);

You can see a lot more examples in Text::Parser::RuleSpec;

This allows you to write your own parser. In fact, because you can now use inheritance to create a sub-class, you can sub-class that sub-class also, thereby making a variant of a given parser.

SUMMARY

  • ExAWK is in Perl language, and includes a whole arsenal of utility functions, and also the power to use any functions from any desired CPAN package.

  • Execution of rules can be controlled and not all rules need to be run.

  • In ExAWK, you may use identifiers like ${-1}, ${2+}, @{3+} etc. Using any positional identifier automatically adds a condition that tests if the line has the minimum number of fields required.

  • You can use regular Perl variables inside rules, or you may use stashed variables.

  • You can sub-class Text::Parser to make your own parser class. And then you can sub-class that further to re-use your code and create multiple variants.

BUGS

Please report any bugs or feature requests on the bugtracker website http://github.com/balajirama/Text-Parser/issues

When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.

AUTHOR

Balaji Ramasubramanian <balajiram@cpan.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2018-2019 by Balaji Ramasubramanian.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.