NAME
RTF::Tokenizer - Tokenize RTF
VERSION
version 1.20
DESCRIPTION
Tokenizes RTF
SYNOPSIS
use RTF::Tokenizer;
# Create a tokenizer object
my $tokenizer = RTF::Tokenizer->new();
my $tokenizer = RTF::Tokenizer->new( string => '{\rtf1}' );
my $tokenizer = RTF::Tokenizer->new( string => '{\rtf1}', note_escapes => 1 );
my $tokenizer = RTF::Tokenizer->new( file => \*STDIN );
my $tokenizer = RTF::Tokenizer->new( file => 'lala.rtf' );
my $tokenizer = RTF::Tokenizer->new( file => 'lala.rtf', sloppy => 1 );
# Populate it from a file
$tokenizer->read_file('filename.txt');
# Or a file handle
$tokenizer->read_file( \*STDIN );
# Or a string
$tokenizer->read_string( '{\*\some rtf}' );
# Get the first token
my ( $token_type, $argument, $parameter ) = $tokenizer->get_token();
# Ooops, that was wrong...
$tokenizer->put_token( 'control', 'b', 1 );
# Let's have the lot...
my @tokens = $tokenizer->get_all_tokens();
INTRODUCTION
This documentation assumes some basic knowledge of RTF. If you lack that, go read The_RTF_Cookbook:
http://search.cpan.org/search?dist=RTF-Writer
METHODS
new()
Instantiates an RTF::Tokenizer object.
Named parameters:
file
- calls the read_file
method with the value provided after instantiation
string
- calls the read_string
method with the value provided after instantiation
note_escapes
- boolean - whether to give RTF Escapes a token type of escape
(true) or control
(false, default)
sloppy
- boolean - whether or not to allow some illegal but common RTF sequences found 'in the wild'. As of 1.08
, this currently only allows control words with a numeric argument to have a text field right after with no delimiter, like:
\control1Plaintext
but this may change in future releases. Defaults false.
preserve_whitespace
- boolean - ... the RTF specification tells you to strip whitespace which comes after control words, and newlines at the beginning and ending of text areas. One result of that is that you can't actually round-trip the output of the tokenization process. Turning this on is probably a bad idea, but someone cared enough to send me a patch for it, so why not. Defaults false, and you should leave it that way.
read_string( STRING )
Appends the string to the tokenizer-object's buffer (earlier versions would over-write the buffer - this version does not).
read_file( \*FILEHANDLE )
read_file( $IO_File_object )
read_file( 'filename' )
Appends a chunk of data from the filehandle to the buffer, and remembers the filehandle, so if you ask for a token, and the buffer is empty, it'll try and read the next line from the file (earlier versions would over-write the buffer - this version does not).
This chunk is 500 characters, and then whatever is left until the next occurrence of the IRS (a newline character in this case). If for whatever reason, you want to change that number to something else, use initial_read
.
get_token()
Returns the next token as a three-item list: 'type', 'argument', 'parameter'. Token is one of: text
, control
, group
, escape
or eof
.
If you turned on preserve_whitespace
, then you may get a forth item for control
tokens.
text
-
'type' is set to 'text'. 'argument' is set to the text itself. 'parameter' is left blank. NOTE:
\{
,\}
, and\\
are all returned as control words, rather than rendered as text for you, as are\_
,\-
and friends. control
-
'type' is 'control'. 'argument' is the control word or control symbol. 'parameter' is the control word's parameter if it has one - this will be numeric, EXCEPT when 'argument' is a literal ', in which case it will be a two-letter hex string.
If you turned on
preserve_whitespace
, you'll get a forth item, which will be the whitespace or a defined empty string. group
-
'type' is 'group'. If it's the beginning of an RTF group, then 'argument' is 1, else if it's the end, argument is 0. 'parameter' is not set.
eof
-
End of file reached. 'type' is 'eof'. 'argument' is 1. 'parameter' is 0.
escape
-
If you specifically turn on this functionality, you'll get an
escape
type, which is identical tocontrol
, only, it's only returned for escapes.
get_all_tokens
As per get_token
, but keeps calling get_token
until it hits EOF. Returns a list of arrayrefs.
put_token( type, token, argument )
Adds an item to the token cache, so that the next time you call get_token, the arguments you passed here will be returned. We don't check any of the values, so use this carefully. This is on a first in last out basis.
sloppy( [bool] )
Decides whether we allow some types of broken RTF. See new()
's docs for a little more explanation about this. Pass it 1 to turn it on, 0 to turn it off. This will always return undef.
initial_read( [number] )
Don't call this unless you actually have a good reason. When the Tokenizer reads from a file, it first attempts to work out what the correct input record-seperator should be, by reading some characters from the file handle. This value starts off as 512, which is twice the amount of characters that version 1.7 of the RTF specification says you should go before including a line feed if you're writing RTF.
Called with no argument, this returns the current value of the number of characters we're going to read. Called with a numeric argument, it sets the number of characters we'll read.
You really don't need to use this method.
debug( [number] )
Returns (non-destructively) the next 50 characters from the buffer, OR, the number of characters you specify. Printing these to STDERR, causing fatal errors, and the like, are left as an exercise to the programmer.
Note the part about 'from the buffer'. It really means that, which means if there's nothing in the buffer, but still stuff we're reading from a file it won't be shown. Chances are, if you're using this function, you're debugging. There's an internal method called _get_line
, which is called without arguments ($self-
_get_line()>) that's how we get more stuff into the buffer when we're reading from filehandles. There's no guarentee that'll stay, or will always work that way, but, if you're debugging, that shouldn't matter.
NOTES
To avoid intrusively deep parsing, if an alternative ASCII representation is available for a Unicode entity, and that ASCII representation contains {
, or \
, by themselves, things will go funky. But I'm not convinced either of those is allowed by the spec.
AUTHOR
Pete Sergeant -- pete@clueball.com
LICENSE
Copyright Pete Sergeant.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.