NAME

Text::Tokenizer - Perl extension for tokenizing text(config) files

SYNOPSIS

  use Text::Tokenizer ':all';

  #open file and set add it to tokenizer inputs
  open(F_CONFIG, "input.conf") || die("failed to open input.conf");
  $tok_id	= tokenizer_new(F_CONFIG);
  tokenizer_options(TOK_OPT_NOUNESCAPE|TOK_OPT_PASSCOMMENT);

  while(1)
  {
	($string, $tok_type, $line, $err, $errline)	= tokenizer_scan();
	last if($tok_type == TOK_ERROR || $tok_type == TOK_EOF);

	if($tok_type == TOK_TEXT)	{ 	}
	elsif($tok_type == TOK_BLANK)	{ 	}
	elsif($tok_type == TOK_DQUOTE)	{ $string	= "\"$str\"";	}
	elsif($tok_type == TOK_SQUOTE)	{ $string	= "\'$str\'";	}
	elsif($tok_type == TOK_SIQUOTE)	{ $string	= "\`$str\'";	}
	elsif($tok_type == TOK_IQUOTE)	{ $string	= "\`$str\`";	}
	elsif($tok_type == TOK_EOL)	{ $string	= "\n";		}
	elsif($tok_type == TOK_COMMENT)	{	}
	elsif($tok_type == TOK_UNDEF)
		{ last;	}
	else	{ last;	};
	print $string;
  }
  tokenizer_delete($tok_id);


  Very complex example of using Text::Tokenizer can be found in passwd_exp - tool for password
  expiration notification (http://devel.dob.sk/passwd_exp)

DESCRIPTION

Text::Tokenizer is very fast lexical analyzer, that can be used to process input text from file or buffer to basic tokens:

  • NORMAL TEXT

  • DOUBLE QUOTED "TEXT"

  • SINGLE QUOTED 'TEXT'

  • INVERSE QUOTED 'TEXT'

  • SINGLE-INVERSE QUOTED `TEXT'

  • WHITESPACE TEXT

  • #COMMENTS

  • END OF LINE

  • END OF FILE

EXPORT

None by default. You have to selectively import methods or constants or use ':all' to import all constants & methods.

CONSTANTS

TOKEN TYPES Token types that tokenizer returns.

TOK_UNDEF

Undefined token (tokenizer error)

TOK_TEXT

Normal_text

TOK_DQUOTE

"Double quoted text"

TOK_SQUOTE

'Single quoted text'

TOK_IQUOTE

`Inverse quoted text`

TOK_SIQUOTE

`Single-inverse quoted text'

TOK_BLANK

Whitespace text

TOK_COMMENT

#Comment

TOK_EOL

End of Line

TOK_EOF

End of File

TOK_ERROR

Error Condition (see ERROR_TYPES)

ERROR TYPES Error codes that will tokenizer return if error happens.

NOERR

No error

UNCLOSED_DQUOTE

Unclosed double quote found

UNCLOSED_SQUOTE

Unclosed single quote found

UNCLOSED_IQUOTE

Unclosed inverse quote found

NOCONTEXT

Failed to allocate tokenizer context (FATAL ERROR)

TOKENIZER OPTIONS Options configurable for tokenizer. They should be OR-ed when passing to tokenizer_options.

TOK_OPT_DEFAULT

Default options set, equals to TOK_OPT_NOUNESCAPE

TOK_OPT_NONE

Set no options. Tokenizer will do in it's default behaviour - it will not unescape anything and it will not pass comments to you.

TOK_OPT_NOUNESCAPE

Disable characters & lines unescaping.

TOK_OPT_SIQUOTE

Enable looking for `single-inverse quote' combination.

TOK_OPT_UNESCAPE

Unescape chars & lines.

TOK_OPT_UNESCAPE_CHARS

Unescape chars (inside of quotes only)

TOK_OPT_UNESCAPE_LINES

Unescape lines (inside of quotes only)

TOK_OPT_PASSCOMMENT

Enable comment passing to user routines.

TOK_OPT_UNESCAPE_NQ_LINES

Unescape lines (outside of quotes). Escaped end of line will not terminate value processing processing. So escaped multiline text will be returned as single line string.

METHODS

$options = tokenizer_options(OPTIONS)

Set tokenizer options.

$tok_id = tokenizer_new(FILE_HANDLE)

Create new tokenizer instance(context) from FILE_HANDLE identified by $tok_id.

$tok_id = tokenizer_new_strbuf(BUFFER, LENGTH)

Create new tokenizer instance from string BUFFER long LENGTH characters. Return its tokenizer instance id.

@tok = tokenizer_scan()

Scan current tokenizer instance, and return first token found. @tok = ($string, $type, $line, $error, $error_line)

$string - found token string
$type - it's type
$line - current line
$error - equals error code if error occurs
$error_line - line number where error begins (unclosed quote position)
tokenizer_exists(TOK_ID)

Test if tokenizer instance exists.

tokenizer_switch(TOK_ID)

Switch to another tokenizer instance (like when you perform include statement).

tokenizer_delete(TOK_ID)

Delete tokenizer instance You have to do it exactly on EOF to release tokenizer reference to file or buffer.

tokenizer_flush(TOK_ID)

Flush tokenizer instance. This function discards the instance buffer\s contents, so the next time the scanner attempts to match a token from the buffer, it will have to fill it.

SEE ALSO

This tokenizer is based on code generated by flex - fast lexical analyzer generator (http://lex.sourceforge.net).

AUTHOR

Samuel Behan, (http://devel.dob.sk)

COPYRIGHT AND LICENSE

Copyright 2003-2011 by Samuel Behan

This library is free software; you can redistribute it and/or modify it under the terms of GNU/GPL v3.