NAME
RTF::Parser - An event-driven RTF Parser
VERSION
version 1.11
DESCRIPTION
An event-driven RTF Parser
PUBLIC SERVICE ANNOUNCEMENT
DO NOT USE THIS MODULE UNLESS YOU HAVE NO ALTERNATIVE. Need rtf2*? Google for pandoc.
A very short history lesson...
1.07
of this module was released in 1999 by the original author, Philippe Verdret. I took over the module around 2004 with high intentions. I added almost all of the POD, all of the tests, and most of the comments, and rejigged the whole thing to use RTF::Tokenizer for tokenizing the incoming RTF, which fixed a whole class of problems.
The big problem is really that the whole module is an API which happens to have rtf2html
and rtf2text
stuck on top of it. Any serious changes involve breaking the API, and that seems the greater sin than telling people to go and get themselves a better RTF convertor suite.
I had high hopes of overhauling the whole thing, but it didn't happen. I handed over maintainership some years later, but no new version was forthcoming, and the module has languished since then. There are many open bugs on rt.cpan.org and in the reviews.
In a moment of weakness, I've picked up the module again with the aim of adding this message, fixing one or two very minor bugs, and putting a version that doesn't have UNAUTHORIZED RELEASE in big red letters on the CPAN.
I doubt I'll ever tackle the bigger bugs (Unicode support), but I will accept patches I can understand.
IMPORTANT HINTS
RTF parsing is non-trivial. The inner workings of these modules are somewhat scary. You should go and read the 'Introduction' document included with this distribution before going any further - it explains how this distribution fits together, and is vital reading.
If you just want to convert RTF to HTML or text, from inside your own script, jump straight to the docs for RTF::HTML::Converter or RTF::TEXT::Converter respectively.
SUBCLASSING RTF::PARSER
When you subclass RTF::Parser, you'll want to do two things. You'll firstly want to overwrite the methods below described as the API. This describes what we do when we have tokens that aren't control words (except 'symbols' - see below).
Then you'll want to create a hash that maps control words to code references that you want executed. They'll get passed a copy of the RTF::Parser object, the name of the control word (say, 'b'), any arguments passed with the control word, and then 'start'.
An example...
The following code removes bold tags from RTF documents, and then spits back out RTF.
{
# Create our subclass
package UnboldRTF;
# We'll be doing lots of printing without newlines, so don't buffer output
$|++;
# Subclassing magic...
use RTF::Parser;
@UnboldRTF::ISA = ( 'RTF::Parser' );
# Redefine the API nicely
sub parse_start { print STDERR "Starting...\n"; }
sub group_start { print '{' }
sub group_end { print '}' }
sub text { print "\n" . $_[1] }
sub char { print "\\\'$_[1]" }
sub symbol { print "\\$_[1]" }
sub parse_end { print STDERR "All done...\n"; }
}
my %do_on_control = (
# What to do when we see any control we don't have
# a specific action for... In this case, we print it.
'__DEFAULT__' => sub {
my ( $self, $type, $arg ) = @_;
$arg = "\n" unless defined $arg;
print "\\$type$arg";
},
# When we come across a bold tag, we just ignore it.
'b' => sub {},
);
# Grab STDIN...
my $data = join '', (<>);
# Create an instance of the class we created above
my $parser = UnboldRTF->new();
# Prime the object with our control handlers...
$parser->control_definition( \%do_on_control );
# Don't skip undefined destinations...
$parser->dont_skip_destinations(1);
# Start the parsing!
$parser->parse_string( $data );
METHODS
new
Creates a new RTF::Parser object. Doesn't accept any arguments.
parse_stream( \*FH )
This function used to accept a second parameter - a function specifying how the filehandle should be read. This is deprecated, because I could find no examples of people using it, nor could I see why people might want to use it.
Pass this function a reference to a filehandle (or, now, a filename! yay) to begin reading and processing.
parse_string( $string )
Pass this function a string to begin reading and processing.
control_definition
The code that's executed when we trigger a control event is kept in a hash. We're holding this somewhere in our object. Earlier versions would make the assumption we're being subclassed by RTF::Control, which isn't something I want to assume. If you are using RTF::Control, you don't need to worry about this, because we're grabbing %RTF::Control::do_on_control, and using that.
Otherwise, you pass this method a reference to a hash where the keys are control words, and the values are coderefs that you want executed. This sets all the callbacks... The arguments passed to your coderefs are: $self, control word itself (like, say, 'par'), any parameter the control word had, and then 'start'.
If you don't pass it a reference, you get back the reference of the current control hash we're holding.
rtf_control_emulation
If you pass it a boolean argument, it'll set whether or not it thinks RTF::Control has been loaded. If you don't pass it an argument, it'll return what it thinks...
dont_skip_destinations
The RTF spec says that we skip any destinations that we don't have an explicit handler for. You could well not want this. Accepts a boolean argument, true to process destinations, 0 to skip the ones we don't understand.
API
These are some methods that you're going to want to over-ride if you subclass this modules. In general though, people seem to want to subclass RTF::Control, which subclasses this module.
parse_start
Called before we start parsing...
parse_end
Called when we're finished parsing
group_start
Called when we encounter an opening {
group_end
Called when we encounter a closing }
text
Called when we encounter plain-text. Is given the text as its first argument
char
Called when we encounter a hex-escaped character. The hex characters are passed as the first argument.
symbol
Called when we come across a control character. This is interesting, because, I'd have treated these as control words, so, I'm using Philippe's list as control words that'll trigger this for you. These are -_~:|{}*'\
. This needs to be tested.
bitmap
Called when we come across a command that's talking about a linked bitmap file. You're given the file name.
binary
Called when we have binary data. You get passed it.
AUTHOR
Peter Sergeant pete@clueball.com
, originally by Philippe Verdret
COPYRIGHT
Copyright 2004 Pete Sergeant.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
CREDITS
This work was carried out under a grant generously provided by The Perl Foundation - give them money!