NAME

Text::FromAny - a module to read pure text from a vareiety of formats

SYNOPSIS

my $tFromAny = Text::FromAny->new(file => '/some/text/file');
my $text = $tFromAny->text;

SUPPORTED FORMATS

Text::FromAny can currently read the following formats:

Portable Document format - PDF
Legacy/binary MSWord .doc
OpenDocument Text
Legacy OpenOffice.org writer
"Office Open XML" text
Rich text format - RTF
(X)HTML
Plaintext

ATTRIBUTES

Attributes can be supplied to the new constructor, as well as set by running object->attribute(value). The "file" attribute MUST be supplied during construction.

file

The file to read. MUST be supplied during construction time (and can not be changed later). Can be any of the supported formats. If it is not of any supported format, or an unknown format, the object will still work, though ->text will return undef.

allowGuess

This is a boolean, defaulting to true. If Text::FromAny is unable to properly detect the filetype it will fall back to guessing the filetype based upon the file extension. Set this to false to disable this.

The default for allowGuess is subject to change in later versions, so if you depend on it being either on or off, you are best off explicitly requesting that behaviour, rather than relying on the defaults.

allowExternal

This is a boolean, defaulting to false. If the perl-based PDF reading method fails (PDF::CAM), then Text::FromAny will fall back to calling the system pdftotext(1) to get the text. PDF::CAM reads most PDFs, but has troubles with a select few, and those can be handled by pdftotext(1) from the Poppler library.

The default for allowExternal is subject to change in later versions, so if you depend on it being either on or off, you are best off explicitly requesting that behaviour, rather than relying on the defaults.

METHODS

text

Returns the text contained in the file, or undef if the file format is unknown or unsupported.

Normally Text::FromAny will only read the file once, and then cache the text. However if you change the value of either the allowGuess or allowExternal attributes, Text::FromAny will re-read the file, as those can affect how a file is read.

detectedType

Returns the detected filetype (or undef if unknown or unsupported). The filetype is returned as a string, and can be any of the following:

pdf  => PDF
odt  => OpenDocument text
sxw  => Legacy OpenOffice.org Writer
doc  => msword
docx => "Open XML"
rtf  => RTF
txt  => Cleartext
html => HTML (or XHTML)

BUGS AND LIMITATIONS

None known.

Please report any bugs or feature requests to http://github.com/portu/Text-FromAny/issues.

AUTHOR

Eskild Hustvedt, <zerodogg@cpan.org>

LICENSE AND COPYRIGHT

Copyright (C) 2010 by Eskild Hustvedt

This library is free software; you can redistribute it and/or modify it under the terms of either:

a) the GNU General Public License as published by the Free
Software Foundation; either version 3, or (at your option) any
later version, or
b) the "Artistic License" which comes with this Kit.

This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See either the GNU General Public License or the Artistic License for more details.

You should have received a copy of the Artistic License in the file named "COPYING.artistic". If not, I'll be glad to provide one.

You should also have received a copy of the GNU General Public License along with this library in the file named "COPYING.gpl". If not, see <http://www.gnu.org/licenses/>.