NAME
Text::Parser::Manual::ExtendedAWKSyntax - The ExAWK (extended AWK) syntax itself
VERSION
version 1.000
THE EXTENDED AWK LANGUAGE
So want to get started with writing your own parser based on Text::Parser. And the best place to start is to learn how to write parsing rules.
In this chapter, we only describe the ExAWK language syntax and features. But this is the main part. In fact, the remaining things are extremely simple. And when you see how intuitive the rules are, you'll wonder why this was not part of native Perl.
Parsing rules may be specified using the add_rule
, BEGIN_rule
, and END_rule
methods. Alternatively, you may sub-class Text::Parser, and use the applies_rule
syntax sugar from Text::Parser::RuleSpec. In either case, the syntax and form of the rules is what determines what your text parser does.
BASIC SYNTAX
The basic form of the ExAWK rule is like this:
if => 'condition', do => 'action'
## options:
## dont_record => 0|1
## continue_to_next => 0|1
AWK programmers would recognize that it is similar to the basic syntax of AWK:
condition { action; }
Pay attention to the single quotes in the value of the if
and do
keys. This is important as you'll see below.
Similarities with AWK
Optional parts of a rule
Just as in AWK the condition
can be as simple as a regular expression, or a complex boolean expression. In AWK you could do:
$ awk '/EMAIL:/ {print $2}' file.txt
In ExAWK you could write something like this to get something equivalent (Note the need for "\n"
):
$parser->add_rule(
if => 'm/EMAIL:/', do => 'print $2, "\n"'
);
Similar to AWK, in the Extended AWK language too, if the condition
is specified, then the action
is optional, and if the action
is specified, then the condition
is optional. The default condition
in ExAWK is just like in AWK: true for each input line. So the following will simply print every line in a file:
$parser->add_rule(
do => 'print' # The default 'if' is true for all lines
);
Or else:
$parser->add_rule(
if => 'm/^\d+/' # The default 'do' stores the whole line.
);
Field identifiers, and line identifier
AWK is very popular for its intuitive field identifiers $1
, $2
, $3
etc. ExAWK provides the same and much more.
In AWK, $0
identifies the whole line. The same is true in ExAWK. But in addition, the Perl in-built variable $_
also contains the line. So any in-built functions that take a missing parameter to be $_
will behave accordingly. This is how a terse rule like do => 'print'
happens to work. The important difference between $0
and $_
is that $0
is just an identifier (i.e., you cannot change its value), whereas $_
is an actual variable which you can change. By changing $_
, you cannot change the value of the current line, or the value of $0
.
Similarly, other positional field identifiers like $1
, $2
, etc., are not variables. Even in AWK, they are not variables. They are just positional field identifiers. They represent an Rvalue and cannot be modified. So for example
$ awk '// {$1 = "something";}' file.txt
will not change anything. In the same way:
$parser->add_rule(do => '$1 = "something";');
In particular they are not the same as the native Perl regular expression field identifiers $1
, $2
etc., which are used in regexp substitutions.
The positional field identifiers $1
, $2
etc. mean something else inside the string expressions of ExAWK. Like AWK, $1
represents the first field, $2
represents the second field, and so on.
Note: In the UNIX implementation of AWK, the positional identifiers are limited to $9
. In POSIX implementations this limitation has already been removed. In ExAWK also there is no limit to the number of positional identifiers.
Important note about quotes
You should always use single quotes (''
) for your rule strings, and not double quotes (""
). This is because sigils like $
get dereferenced inside double quotes, and $0
, $1
, etc., have no value in your main code.
Differences with AWK
Language
The first difference with AWK is the language. ExAWK condition
and action
strings are Perl. So for example, to compare strings you should use eq
and not ==
like you would in AWK. The condition
and action
strings are transformed into regular Perl and compiled. So if they fail to compile, add_rule method will throw an exception.
Execution loop
In AWK, each rule is run for each line, even if the condition
for a previous rule may be true. If the condition
of a rule is true, the action
is performed.
But in ExAWK, rules are executed until the condition
of one rule is true. By default, the execution of further rules stops at that point. For example:
$parser->add_rule(if => '$1 =~ /^[#]/', dont_record => 1);
In this case, the moment a line leading with #
is encountered, it is ignored, and no other rules are executed. But see the following code:
$parser->add_rule(if => '$1 =~ /^[#][!]/', continue_to_next => 1);
$parser->add_rule(
if => '$1 =~ /perl$/',
do => '$this->abort_reading; print "This is a perl script.\n";'
);
$parser->add_rule(
if => '$1 =~ /bash$/',
do => '$this->abort_reading; print "This is a bash script.\n";'
);
$parser->add_rule(
if => '$1 =~ /^[#]/ or $this->lines_parsed > 0',
do => '$this->abort_reading; print "Neither perl nor bash.\n";',
dont_record => 1
);
Now if a file starts with #!
, the condition for the first rule is met, and it immediately tests the next rule. If that condition is met, it will abort reading at that point and print the message. But if the condition for the second rule is not met, it will test the next rule. If that condition is also not met, then it will test the fourth rule. In it will surely meet the condition for the fourth rule (we could have skipped it), and will execute that rule. At this point it will stop because there is no continue_to_next
option.
So in this way we can control the execution sequence.
Default action
We saw that only one of condition
or action
is required, the other may be omitted. The default condition
is same as AWK. But the default action
is different. The default action
in AWK is print
. Thus:
$ awk '/li/' file.txt
will print all the lines with 'li'
somewhere in it. But in ExAWK, since it integrates with the Text::Parser class, the default action is to return
the whole line.
if => 'm/li/'
# returns each line containing 'li', to the parser
# The parser then saves it as a record,
# unless dont_record is true
If you want to print
instead and not record anything, you need to specify that:
if => 'm/li/', do => 'print', dont_record => 1
ENHANCED FEATURES OF EXTENDED AWK LANGUAGE
Reverse field identifiers
To access fields from the end of the line, use identifiers ${-1}
, ${-2}
, etc. ${-1}
is the last field, ${-2}
is the penultimate field, and so forth.
Field range shortcuts
Sometimes, we want to access all the fields starting from the 2nd, or 3rd, leaving all the earlier ones. So we have a set of shortcuts to do all that. Below are some shortcut examples:
SHORTCUT CODE EQUIVALENT MEANING
======== ================ ========
${2+} $this->join_range(1, -1) Everything from second field as a string. Spaces will be collapsed to one space.
@{3+} $this->field_range(2, -1) Everything from third field as an array.
\@{2+} [ $this->field_range(2, -1) ] Arrayref containing everything from second field.
Automatic checks for NF
In AWK if you write:
$ awk '{print $4;}' text.txt
then all lines with 3 or less fields will print a blank line to the screen because $4
evaluates to empty string when there are less than 4 fields on a line. So you would get empty lines. To ensure you take only lines with 4 fields, you need to do:
$ awk 'NF>=4 {print $4;}' text.txt
But in ExAWK this is unnecessary. So the rule:
do => 'return $4;'
automatically sets up a pre-condition for the number of fields NF
and ensures that each line being read has at least 4 fields. This works even for negative positional indicators. So the rule:
do => 'return ${-4};'
This ensures you don't run into undef
records being saved in the parser.
Local variables
You can use any Perl local variables you want. For example:
do => 'my (@numbers) = @{3+};'
Note that @numbers
above is accessible only within that rule action. It is not accessible outside of that do
string.
Use any variable other than $this
.
Suite of string and array utility functions
Perl anyway has more built-in functions that are very useful and better than their AWK counterparts. But in addition, CPAN has a lot of great modules with utility functions. ExAWK gives the programmer adds a few good utility functions, but also makes it very easy to add any other functions:
Scalar::Util :
blessed
,looks_like_number
String::Util : All functions here
List::Util : The following functions:
reduce
,any
,all
,none
,notall
,first
,max
,maxstr
,min
,minstr
,product
,sum
,sum0
,pairs
,unpairs
,pairkeys
,pairvalues
,pairfirst
,pairgrep
,pairmap
,shuffle
,uniq
,uniqnum
, anduniqstr
I have kept this list small to minimize Text::Parser
dependencies. The user can import whatever functions they want from the package of their choice.
How to add other utility functions
Suppose you know of a very useful package (fictitiously named) Useful::Package
. And let's say it has functions foo
and bar
that are very useful and operate on strings. And you wish to use these in your rules. Then do the following in your code:
use Import::Into;
Useful::Package->import::into('Text::Parser::Rule', qw(foo bar));
use Text::Parser;
my $parser = Text::Parser->new();
$parser->add_rule(if => 'bar($2)', do => 'return foo($1).foo($2);');
This means that the power of any new package on CPAN can be harnessed very easily.
THE FUN FEATURES
If you could use the parser object itself to store data, this would open up many possibilities. And that is precisely what this section is about.
The $this
variable
To access the parser object, you can use $this
inside the rule strings. Remember again, that the rule strings should be in single quotes (''
). Here is an example rule using the $this
variable to refer to the parser:
$parser->add_rule(
if => '$this->lines_parsed > 10',
do => '$this->abort_reading;',
dont_record => 1,
);
Important Note on $this
$this
is a real variable. If you modify its value, it will change. So be careful what you do with $this
. If you save the $this
to another variable in the hope that you can retrieve it later, remember that all positional field indicators and range shortcuts are entirely dependent on the this
variable. If this variable is tampered with, you could get garbage results. You have been forewarned.
Stashed variables
The idea of storing data in the $this
variable is the obvious next step. You get what are called "stashed variable"s. Internally these variables are just stored in a hash, but it is as if you are stashing away useful data. You can store any scalar, hashref or arrayref. Note that you can't store an array or hash itself, only arrayref or hashref.
Stashed variables begin with the tilde (~
) character, followed by an alphabet or underscore (_
. So for example:
if => '$1 eq "MARKER:"', do => '~info = $2; ~_secret = $3;'
In the above rule, ~info
and ~_secret
are stashed variables, which are accessible in other rules. You can set stashed variables in a BEGIN_rule
, and access it in later rules. Here is an obvious example:
my $parser = Text::Parser->new();
$parser->BEGIN_rule( do => '~count = 0;' );
$parser->add_rule( if => '$1 eq "ERROR:"', do => '~count++;' );
$parser->read('/path/to/logfile.log');
print "Found ", $parser->stashed('count'), " errors in your logfile\n";
You may forget
a stashed variable, and it will be lost for ever. Or you can simply clear the whole stash of variables by using clear_stash
method.
All stashed variables are forgotten right before read
starts reading the input. So you have a clean stash each time you call read
.
$parser->read('/another/logfile.log');
print "Whereas, the other one had ", $parser->stashed('count'), " errors\n";
You can also have pre-stashed variables that persist across multiple read
method calls. Read more about that here.
Adding and using new class attributes
When you sub-class Text::Parser you can get some very powerful features. You can do something like this:
package MyClass::Parser;
use Moose;
extends 'Text::Parser';
use Text::Parser::RuleSpec;
has section => (
is => 'ro',
isa => 'Str',
lazy => 1,
);
has ids => (
is => 'ro',
isa => 'HashRef[ArrayRef[Str]]',
default => sub {return {};},
lazy => 1,
handles => {
get_section => 'get',
set_section => 'set',
has_section => 'exists',
},
);
sub add_id {
my $self = shift;
my $ids = $self->get_section($self->section);
push @{$ids}, shift;
}
applies_rule find_section => (
if => '$1 eq "SECTION:"',
do => '$this->section($2); $this->set_section($2 => []);',
);
applies_rule name_in_section => (
if => '$1 eq "ID"',
do => '$this->add_id($2);'
);
You can see a lot more examples in Text::Parser::RuleSpec;
This allows you to write your own parser. In fact, because you can now use inheritance to create a sub-class, you can sub-class that sub-class also, thereby making a variant of a given parser.
SUMMARY
ExAWK is in Perl language, and includes a whole arsenal of utility functions, and also the power to use any functions from any desired CPAN package.
Execution of rules can be controlled and not all rules need to be run.
In ExAWK, you may use identifiers like
${-1}
,${2+}
,@{3+}
etc. Using any positional identifier automatically adds a condition that tests if the line has the minimum number of fields required.You can use regular Perl variables inside rules, or you may use stashed variables.
You can sub-class
Text::Parser
to make your own parser class. And then you can sub-class that further to re-use your code and create multiple variants.
BUGS
Please report any bugs or feature requests on the bugtracker website http://github.com/balajirama/Text-Parser/issues
When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.
AUTHOR
Balaji Ramasubramanian <balajiram@cpan.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2018-2019 by Balaji Ramasubramanian.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.