NAME
Devel::Tokenizer::C - Generate C source for fast keyword tokenizer
SYNOPSIS
use Devel::Tokenizer::C;
$t = Devel::Tokenizer::C->new(TokenFunc => sub { "return \U$_[0];\n" });
$t->add_tokens(qw( bar baz ))->add_tokens(['for']);
$t->add_tokens([qw( foo )], 'defined DIRECTIVE');
print $t->generate;
DESCRIPTION
The Devel::Tokenizer::C module provides a small class for creating the essential ANSI C source code for a fast keyword tokenizer.
The generated code is optimized for speed. On the ANSI-C keyword set, it's 2-3 times faster than equivalent code generated with the gprof
utility.
The above example would print the following C source code:
switch (tokstr[0])
{
case 'b':
switch (tokstr[1])
{
case 'a':
switch (tokstr[2])
{
case 'r':
if (tokstr[3] == '\0')
{ /* bar */
return BAR;
}
goto unknown;
case 'z':
if (tokstr[3] == '\0')
{ /* baz */
return BAZ;
}
goto unknown;
default:
goto unknown;
}
default:
goto unknown;
}
case 'f':
switch (tokstr[1])
{
case 'o':
switch (tokstr[2])
{
#if defined DIRECTIVE
case 'o':
if (tokstr[3] == '\0')
{ /* foo */
return FOO;
}
goto unknown;
#endif /* defined DIRECTIVE */
case 'r':
if (tokstr[3] == '\0')
{ /* for */
return FOR;
}
goto unknown;
default:
goto unknown;
}
default:
goto unknown;
}
default:
goto unknown;
}
So the generated code only includes the main switch
statement for the tokenizer. You can configure most of the generated code to fit for your application.
METHODS
new
The following configuration options can be passed to the constructor.
CaseSensitive => 0 | 1
Boolean defining whether the generated tokenizer should be case sensitive or not. This will only affect the letters A-Z. The default is 1, so the generated tokenizer is case sensitive.
Comments => 0 | 1
Boolean defining whether the generated code should contain comments or not. The default is 1, so comments will be generated.
Indent => STRING
String to be used for one level of indentation. The default is two space characters.
MergeSwitches => 0 | 1
Boolean defining whether nested switch
statements containing only a single case
should be merged into a single if
statement. This is usually only done at the end of a branch. With MergeSwitches
, merging will also be done in the middle of a branch. E.g. the code
$t = Devel::Tokenizer::C->new(
TokenFunc => sub { "return \U$_[0];\n" },
MergeSwitches => 1,
);
$t->add_tokens(qw( carport carpet muppet ));
print $t->generate;
would output this switch
statement:
switch (tokstr[0])
{
case 'c':
if (tokstr[1] == 'a' &&
tokstr[2] == 'r' &&
tokstr[3] == 'p')
{
switch (tokstr[4])
{
case 'e':
if (tokstr[5] == 't' &&
tokstr[6] == '\0')
{ /* carpet */
return CARPET;
}
goto unknown;
case 'o':
if (tokstr[5] == 'r' &&
tokstr[6] == 't' &&
tokstr[7] == '\0')
{ /* carport */
return CARPORT;
}
goto unknown;
default:
goto unknown;
}
}
goto unknown;
case 'm':
if (tokstr[1] == 'u' &&
tokstr[2] == 'p' &&
tokstr[3] == 'p' &&
tokstr[4] == 'e' &&
tokstr[5] == 't' &&
tokstr[6] == '\0')
{ /* muppet */
return MUPPET;
}
goto unknown;
default:
goto unknown;
}
Strategy => 'ordered' | 'narrow' | 'wide'
The strategy to be used for sorting character positions. ordered
will leave the characters in their normal order. narrow
will sort the characters positions so that the positions with least character variation are checked first. wide
will do exactly the opposite. (If you're confused now, just try it. ;-)
The default is ordered
. You can only use narrow
and wide
together with StringLength
.
The code
$t = Devel::Tokenizer::C->new(
TokenFunc => sub { "return \U$_[0];\n" },
StringLength => 'len',
Strategy => 'ordered',
);
$t->add_tokens(qw( mhj xho mhx ));
print $t->generate;
would output this switch
statement:
switch (len)
{
case 3: /* 3 tokens of length 3 */
switch (tokstr[0])
{
case 'm':
switch (tokstr[1])
{
case 'h':
switch (tokstr[2])
{
case 'j':
{ /* mhj */
return MHJ;
}
case 'x':
{ /* mhx */
return MHX;
}
default:
goto unknown;
}
default:
goto unknown;
}
case 'x':
if (tokstr[1] == 'h' &&
tokstr[2] == 'o')
{ /* xho */
return XHO;
}
goto unknown;
default:
goto unknown;
}
default:
goto unknown;
}
Using the narrow
strategy, the switch
statement would be:
switch (len)
{
case 3: /* 3 tokens of length 3 */
switch (tokstr[1])
{
case 'h':
switch (tokstr[0])
{
case 'm':
switch (tokstr[2])
{
case 'j':
{ /* mhj */
return MHJ;
}
case 'x':
{ /* mhx */
return MHX;
}
default:
goto unknown;
}
case 'x':
if (tokstr[2] == 'o')
{ /* xho */
return XHO;
}
goto unknown;
default:
goto unknown;
}
default:
goto unknown;
}
default:
goto unknown;
}
Using the wide
strategy, the switch
statement would be:
switch (len)
{
case 3: /* 3 tokens of length 3 */
switch (tokstr[2])
{
case 'j':
if (tokstr[0] == 'm' &&
tokstr[1] == 'h')
{ /* mhj */
return MHJ;
}
goto unknown;
case 'o':
if (tokstr[0] == 'x' &&
tokstr[1] == 'h')
{ /* xho */
return XHO;
}
goto unknown;
case 'x':
if (tokstr[0] == 'm' &&
tokstr[1] == 'h')
{ /* mhx */
return MHX;
}
goto unknown;
default:
goto unknown;
}
default:
goto unknown;
}
StringLength => STRING
Identifier of the C variable that contains the length of the string, when available. If the string length is know, switching can be done more effectively. That doesn't mean that it is more effective to compute the string length first. If you don't know the string length, just don't use this option. This is also the default.
TokenEnd => STRING
Character that defines the end of each token. The default is the null character '\0'
. Can also be undef
if tokens don't end with a special character.
TokenFunc => SUBROUTINE
A reference to the subroutine that returns the code for each token match. The only parameter to the subroutine is the token string.
This is the default subroutine:
TokenFunc => sub { "return $_[0];\n" }
It is the responsibility of the supplier of this routine to make the code exit out of the generated code once a token is matched, otherwise the behaviour of the generated code is undefined.
TokenString => STRING
Identifier of the C character array that contains the token string. The default is tokstr
.
UnknownLabel => STRING
Label that should be jumped to via goto
if there's no keyword matching the token. The default is unknown
.
UnknownCode => STRING
Code that should be executed if there's no keyword matching the token. This is an alternative to UnknownLabel
. If UnknownCode
is present, it will override UnknownLabel
.
add_tokens
You can add tokens using the add_tokens
method.
The method either takes a list of token strings or a reference to an array of token strings which can optionally be followed by a preprocessor directive string.
Calls to add_tokens
can be chained together, as the method returns a reference to its calling object.
generate
The generate
method will return a string with the tokenizer switch
statement. If no tokens were added, it will return an empty string.
You can optionally pass an Indent
option to the generate
method to specify a string used for indenting the whole switch
statement, e.g.:
print $t->generate(Indent => "\t");
This is completely independent from the Indent
option passed to the constructor.
AUTHOR
Marcus Holland-Moritz <mhx@cpan.org>
BUGS
I hope none, since the code is pretty short. Perhaps lack of functionality ;-)
COPYRIGHT
Copyright (c) 2002-2008, Marcus Holland-Moritz. All rights reserved. This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.