NAME
Text::Tokenizer - Perl extension for tokenizing text(config) files
SYNOPSIS
use Text::Tokenizer ':all';
#open file and set add it to tokenizer inputs
open(F_CONFIG, "input.conf") || die("failed to open input.conf");
$tok_id = tokenizer_new(F_CONFIG);
tokenizer_options(TOK_OPT_NOUNESCAPE|TOK_OPT_PASSCOMMENT);
while(1)
{
($string, $tok_type, $line, $err, $errline) = tokenizer_scan();
last if($tok_type == TOK_ERROR || $tok_type == TOK_EOF);
if($tok_type == TOK_TEXT) { }
elsif($tok_type == TOK_BLANK) { }
elsif($tok_type == TOK_DQUOTE) { $string = "\"$str\""; }
elsif($tok_type == TOK_SQUOTE) { $string = "\'$str\'"; }
elsif($tok_type == TOK_SIQUOTE) { $string = "\`$str\'"; }
elsif($tok_type == TOK_IQUOTE) { $string = "\`$str\`"; }
elsif($tok_type == TOK_EOL) { $string = "\n"; }
elsif($tok_type == TOK_COMMENT) { }
elsif($tok_type == TOK_UNDEF)
{ last; }
else { last; };
print $string;
}
tokenizer_delete($tok_id);
Very complex example of using Text::Tokenizer can be found in passwd_exp - tool for password
expiration notification (http://freshmeat.net/projects/passwd_exp)
DESCRIPTION
Text::Tokenizer is very fast lexical analyzer, that can be used to process input text from file or buffer to basic tokens:
NORMAL TEXT
DOUBLE QUOTED "TEXT"
SINGLE QUOTED 'TEXT'
INVERSE QUOTED 'TEXT'
SINGLE-INVERSE QUOTED `TEXT'
WHITESPACE TEXT
#COMMENTS
END OF LINE
END OF FILE
EXPORT
None by default. You have to selectively import methods or constants or use ':all' to import all constants & methods.
CONSTANTS
TOKEN TYPES Token types that tokenizer returns.
- TOK_UNDEF
-
Undefined token (tokenizer error)
- TOK_TEXT
-
Normal_text
- TOK_DQUOTE
-
"Double quoted text"
- TOK_SQUOTE
-
'Single quoted text'
- TOK_IQUOTE
-
`Inverse quoted text`
- TOK_SIQUOTE
-
`Single-inverse quoted text'
- TOK_BLANK
-
Whitespace text
- TOK_COMMENT
-
#Comment
- TOK_EOL
-
End of Line
- TOK_EOF
-
End of File
- TOK_ERROR
-
Error Condition (see
ERROR_TYPES
)
ERROR TYPES Error codes that will tokenizer return if error happens.
- NOERR
-
No error
- UNCLOSED_DQUOTE
-
Unclosed double quote found
- UNCLOSED_SQUOTE
-
Unclosed single quote found
- UNCLOSED_IQUOTE
-
Unclosed inverse quote found
- NOCONTEXT
-
Failed to allocate tokenizer context (FATAL ERROR)
TOKENIZER OPTIONS Options configurable for tokenizer. They should be OR-ed when passing to tokenizer_options.
- TOK_OPT_DEFAULT
-
Default options set, equals to TOK_OPT_NOUNESCAPE
- TOK_OPT_NONE
-
Set no options. Tokenizer will do in it's default behaviour - it will not unescape anything and it will not pass comments to you.
- TOK_OPT_NOUNESCAPE
-
Disable characters & lines unescaping.
- TOK_OPT_SIQUOTE
-
Enable looking for `single-inverse quote' combination.
- TOK_OPT_UNESCAPE
-
Unescape chars & lines.
- TOK_OPT_UNESCAPE_CHARS
-
Unescape chars (inside of quotes only)
- TOK_OPT_UNESCAPE_LINES
-
Unescape lines (inside of quotes only)
- TOK_OPT_PASSCOMMENT
-
Enable comment passing to user routines.
- TOK_OPT_UNESCAPE_NQ_LINES
-
Unescape lines (outside of quotes). Escaped end of line will not terminate value processing processing. So escaped multiline text will be returned as single line string.
METHODS
- $options = tokenizer_options(OPTIONS)
-
Set tokenizer options.
- $tok_id = tokenizer_new(FILE_HANDLE)
-
Create new tokenizer instance(context) from FILE_HANDLE identified by $tok_id.
- $tok_id = tokenizer_new_strbuf(BUFFER, LENGTH)
-
Create new tokenizer instance from string BUFFER long LENGTH characters. Return its tokenizer instance id.
- @tok = tokenizer_scan()
-
Scan current tokenizer instance, and return first token found. @tok = ($string, $type, $line, $error, $error_line)
- tokenizer_exists(TOK_ID)
-
Test if tokenizer instance exists.
- tokenizer_switch(TOK_ID)
-
Switch to another tokenizer instance (like when you perform include statment).
- tokenizer_delete(TOK_ID)
-
Delete tokenizer instance (You have to do it exactly on EOF to release connection between file or buffer.
- tokenizer_flush(TOK_ID)
-
Flush tokenizer instance. This function discards the instance buffer's contents, so the next time the scanner attempts to match a token from the buffer, it will have to fill it.
SEE ALSO
This tokenizer is based on code generated by flex - fast lexical analyzer generator (http://lex.sourceforge.net).
AUTHOR
Samuel Behan, <_samkob_(a)_gmail_._com_>
COPYRIGHT AND LICENSE
Copyright 2003-2006 by Samuel Behan
This library is free software; you can redistribute it and/or modify it under the same terms of GNU/GPL v2.