NAME

Devel::Tokenizer::C - Generate C source for fast keyword tokenizer

SYNOPSIS

use Devel::Tokenizer::C;

$t = Devel::Tokenizer::C->new(TokenFunc => sub { "return \U$_[0];\n" });

$t->add_tokens(qw( bar baz ))->add_tokens(['for']);
$t->add_tokens([qw( foo )], 'defined DIRECTIVE');

print $t->generate;

DESCRIPTION

The Devel::Tokenizer::C module provides a small class for creating the essential ANSI C source code for a fast keyword tokenizer.

The generated code is optimized for speed. On the ANSI-C keyword set, it's 2-3 times faster than equivalent code generated with the gprof utility.

The above example would print the following C source code:

switch (tokstr[0])
{
  case 'b':
    switch (tokstr[1])
    {
      case 'a':
        switch (tokstr[2])
        {
          case 'r':
            if (tokstr[3] == '\0')
            {                                     /* bar */
              return BAR;
            }

            goto unknown;

          case 'z':
            if (tokstr[3] == '\0')
            {                                     /* baz */
              return BAZ;
            }

            goto unknown;

          default:
            goto unknown;
        }

      default:
        goto unknown;
    }

  case 'f':
    switch (tokstr[1])
    {
      case 'o':
        switch (tokstr[2])
        {
#if defined DIRECTIVE
          case 'o':
            if (tokstr[3] == '\0')
            {                                     /* foo */
              return FOO;
            }

            goto unknown;
#endif /* defined DIRECTIVE */

          case 'r':
            if (tokstr[3] == '\0')
            {                                     /* for */
              return FOR;
            }

            goto unknown;

          default:
            goto unknown;
        }

      default:
        goto unknown;
    }

  default:
    goto unknown;
}

So the generated code only includes the main switch statement for the tokenizer. You can configure most of the generated code to fit for your application.

METHODS

new

The following configuration options can be passed to the constructor.

CaseSensitive => 0 | 1

Boolean defining whether the generated tokenizer should be case sensitive or not. This will only affect the letters A-Z. The default is 1, so the generated tokenizer is case sensitive.

Comments => 0 | 1

Boolean defining whether the generated code should contain comments or not. The default is 1, so comments will be generated.

Indent => STRING

String to be used for one level of indentation. The default is two space characters.

MergeSwitches => 0 | 1

Boolean defining whether nested switch statements containing only a single case should be merged into a single if statement. This is usually only done at the end of a branch. With MergeSwitches, merging will also be done in the middle of a branch. E.g. the code

$t = Devel::Tokenizer::C->new(
       TokenFunc     => sub { "return \U$_[0];\n" },
       MergeSwitches => 1,
     );

$t->add_tokens(qw( carport carpet muppet ));

print $t->generate;

would output this switch statement:

switch (tokstr[0])
{
  case 'c':
    if (tokstr[1] == 'a' &&
        tokstr[2] == 'r' &&
        tokstr[3] == 'p')
    {
      switch (tokstr[4])
      {
        case 'e':
          if (tokstr[5] == 't' &&
              tokstr[6] == '\0')
          {                                       /* carpet  */
            return CARPET;
          }

          goto unknown;

        case 'o':
          if (tokstr[5] == 'r' &&
              tokstr[6] == 't' &&
              tokstr[7] == '\0')
          {                                       /* carport */
            return CARPORT;
          }

          goto unknown;

        default:
          goto unknown;
      }
    }

    goto unknown;

  case 'm':
    if (tokstr[1] == 'u' &&
        tokstr[2] == 'p' &&
        tokstr[3] == 'p' &&
        tokstr[4] == 'e' &&
        tokstr[5] == 't' &&
        tokstr[6] == '\0')
    {                                             /* muppet  */
      return MUPPET;
    }

    goto unknown;

  default:
    goto unknown;
}

Strategy => 'ordered' | 'narrow' | 'wide'

The strategy to be used for sorting character positions. ordered will leave the characters in their normal order. narrow will sort the characters positions so that the positions with least character variation are checked first. wide will do exactly the opposite. (If you're confused now, just try it. ;-)

The default is ordered. You can only use narrow and wide together with StringLength.

The code

$t = Devel::Tokenizer::C->new(
       TokenFunc     => sub { "return \U$_[0];\n" },
       StringLength  => 'len',
       Strategy      => 'ordered',
     );

$t->add_tokens(qw( mhj xho mhx ));

print $t->generate;

would output this switch statement:

switch (len)
{
  case 3: /* 3 tokens of length 3 */
    switch (tokstr[0])
    {
      case 'm':
        switch (tokstr[1])
        {
          case 'h':
            switch (tokstr[2])
            {
              case 'j':
                {                                 /* mhj */
                  return MHJ;
                }

              case 'x':
                {                                 /* mhx */
                  return MHX;
                }

              default:
                goto unknown;
            }

          default:
            goto unknown;
        }

      case 'x':
        if (tokstr[1] == 'h' &&
            tokstr[2] == 'o')
        {                                         /* xho */
          return XHO;
        }

        goto unknown;

      default:
        goto unknown;
    }

  default:
    goto unknown;
}

Using the narrow strategy, the switch statement would be:

switch (len)
{
  case 3: /* 3 tokens of length 3 */
    switch (tokstr[1])
    {
      case 'h':
        switch (tokstr[0])
        {
          case 'm':
            switch (tokstr[2])
            {
              case 'j':
                {                                 /* mhj */
                  return MHJ;
                }

              case 'x':
                {                                 /* mhx */
                  return MHX;
                }

              default:
                goto unknown;
            }

          case 'x':
            if (tokstr[2] == 'o')
            {                                     /* xho */
              return XHO;
            }

            goto unknown;

          default:
            goto unknown;
        }

      default:
        goto unknown;
    }

  default:
    goto unknown;
}

Using the wide strategy, the switch statement would be:

switch (len)
{
  case 3: /* 3 tokens of length 3 */
    switch (tokstr[2])
    {
      case 'j':
        if (tokstr[0] == 'm' &&
            tokstr[1] == 'h')
        {                                         /* mhj */
          return MHJ;
        }

        goto unknown;

      case 'o':
        if (tokstr[0] == 'x' &&
            tokstr[1] == 'h')
        {                                         /* xho */
          return XHO;
        }

        goto unknown;

      case 'x':
        if (tokstr[0] == 'm' &&
            tokstr[1] == 'h')
        {                                         /* mhx */
          return MHX;
        }

        goto unknown;

      default:
        goto unknown;
    }

  default:
    goto unknown;
}

StringLength => STRING

Identifier of the C variable that contains the length of the string, when available. If the string length is know, switching can be done more effectively. That doesn't mean that it is more effective to compute the string length first. If you don't know the string length, just don't use this option. This is also the default.

TokenEnd => STRING

Character that defines the end of each token. The default is the null character '\0'. Can also be undef if tokens don't end with a special character.

TokenFunc => SUBROUTINE

A reference to the subroutine that returns the code for each token match. The only parameter to the subroutine is the token string.

This is the default subroutine:

TokenFunc => sub { "return $_[0];\n" }

It is the responsibility of the supplier of this routine to make the code exit out of the generated code once a token is matched, otherwise the behaviour of the generated code is undefined.

TokenString => STRING

Identifier of the C character array that contains the token string. The default is tokstr.

UnknownLabel => STRING

Label that should be jumped to via goto if there's no keyword matching the token. The default is unknown.

UnknownCode => STRING

Code that should be executed if there's no keyword matching the token. This is an alternative to UnknownLabel. If UnknownCode is present, it will override UnknownLabel.

add_tokens

You can add tokens using the add_tokens method.

The method either takes a list of token strings or a reference to an array of token strings which can optionally be followed by a preprocessor directive string.

Calls to add_tokens can be chained together, as the method returns a reference to its calling object.

generate

The generate method will return a string with the tokenizer switch statement. If no tokens were added, it will return an empty string.

You can optionally pass an Indent option to the generate method to specify a string used for indenting the whole switch statement, e.g.:

print $t->generate(Indent => "\t");

This is completely independent from the Indent option passed to the constructor.

AUTHOR

Marcus Holland-Moritz <mhx@cpan.org>

BUGS

I hope none, since the code is pretty short. Perhaps lack of functionality ;-)

COPYRIGHT

Copyright (c) 2002-2008, Marcus Holland-Moritz. All rights reserved. This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.