NAME

PDF::OCR2 - extract all text and all image ocr from pdf

SYNOPSIS

use PDF::OCR2;

my $p           = PDF::OCR2->new('./path/to/file.pdf');

my $text_all    = $p->text;
my @text_pages  = $p->text;

my $page_object = $p->page(1);

DESCRIPTION

This is meant to replace PDF::OCR. The backend complexity of this process has been isolated in modules:

PDF::GetImages
PDF::Burst
Image::OCR::Tesseract
PDF::OCR2::Pages - in this distro.

Why not just modify PDF::OCR?? This is such a massive breakdown of code hierachy and interdependency, and such a different interface, that this made more sense. PDF::OCR was ok. But it was messy and really, after some discussion- this is a lot better.

METHODS

new()

Argument is path to pdf file. If there are errors opening the file, warns and returns undef.

See $PDF::OCR2::CHECK_PDF and $PDF::OCR2::REPAIR_XREF.

text()

Takes no argument. In scalar context, returns text of all pages, joined with a pagebreak \f character. In list context, returns text of pages one per element.

page()

Argument is page number (starting with 1) or abs path to temporary page file. Returns PDF::OCR2::Page object. Croaks if you ask for an invalid number.

pages_count()

Returns number of pages. Number of temporary files. Calls PDF::Burst.

CAVEATS

This only works on posix.

ERRORS

If Program dies

You call text() and you get a fatal. Loading a 'corrupt' pdf with PDF::API2 can trigger an error such as this;

Malformed xref in PDF file  at /usr/lib/perl5/site_perl/5.8.8/PDF/API2/Basic/PDF/File.pm line 1198.

This happens because.. All pdfs are equal, but some pdfs are more equal than others. There's fifty kinds of pdf doc versions, etc. Sometimes the pdf is deemed to be corrupt by PDF::API2.

You can "fix" this problem with pdftk..

pdftk $in $out

But, this means modifying the original pdf, which is sketchy.

Maybe if the xref table is bad, we should run the operation on a repaired copy!

Try using another burst method

If you have errors with PDF::API2 saying the pdf is corrupt, likely via PDF::Burst.. Then try this:

use PDF::OCR2;

PDF::Burst::BURST_METHOD = 'CAM_PDF';

# and then...
my $pdf = PDF::OCR2->new('./pathtofile.pdf');
print $pdf->text;
Enable checking the pdf.

If you suspect the pdf is broken, or only want to run this on pdf docs that check ok with PDF::API2

use PDF::OCR2;
$PDF::OCR2::CHECK_PDF = 1;

# and then...
my $pdf = PDF::OCR2->new('./pathtofile.pdf');
print $pdf->text;

This is not enabled by default because it is more expensive. Maybe it should be.

$PDF::OCR2::CHECK_PDF

By default this flag is on. We check the pdf with an eval to PDF::API2 to make sure the pdf does not have errors. This takes a small toll on performance. I suggest to leave it on.

$PDF::OCR2::REPAIR_XREF

By default this flag is off. If the pdf checks bad, we attempt to repair the pdf *to a copy of the file*- this file is put alongside the original and is named $filename_repaired_xref_table.pdf, once this is created, we check again.

So, if both CHECK_PDF and REPAIR_XREF flags are on;
   1. the pdf is checked for correctness
   2. if the pdf is bad, we attempt to fix to a copy of the file
   3. if we can't make a fixed copy, we don't die, but warn and return undef

Thus, if check pdf fails, and repair xref flag is on, we are doing two evals, it could be argued this is expensive, and it is- but then- ocr is expensive, period.

CRIT AND SUGGESTIONS

The AUTHOR is open to any suggestions and requests.

SEE ALSO

CAM::PDF

PDF::API2

PDF::Burst Split a pdf into pages.

PDF::GetImages Split a page into images.

PDF::OCR2::Page Part of this distro.

pdfcheck Included is a program that may be of use. It helps to check a pdf for problems, stats. Very alpha, useful though.

REPLACES

PDF::OCR - deprecated by this module.

AUTHOR

Leo Charre leocharre at cpan dot org

THANKS

These people have made useful inquiries, requests, critiques, code suggestions. Ultimately, they help develop this work.

Long Nguyen
Robert Waters

COPYRIGHT

Copyright (c) 2009 Leo Charre. All rights reserved.

LICENSE

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, i.e., under the terms of the "Artistic License" or the "GNU General Public License".

DISCLAIMER

This package is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

See the "GNU General Public License" for more details.