NAME
PDF::OCR2 - extract all text and all image ocr from pdf
SYNOPSIS
use PDF::OCR2;
my $p = PDF::OCR2->new('./path/to/file.pdf');
my $text_all = $p->text;
my @text_pages = $p->text;
my $page_object = $p->page(1);
DESCRIPTION
This is meant to replace PDF::OCR. The backend complexity of this process has been isolated in modules:
PDF::GetImages
PDF::Burst
Image::OCR::Tesseract
PDF::OCR2::Pages - in this distro.
Why not just modify PDF::OCR?? This is such a massive breakdown of code hierachy and interdependency, and such a different interface, that this made more sense. PDF::OCR was ok. But it was messy and really, after some discussion- this is a lot better.
METHODS
new()
Argument is path to pdf file. If there are errors opening the file, warns and returns undef.
See $PDF::OCR2::CHECK_PDF and $PDF::OCR2::REPAIR_XREF.
text()
Takes no argument. In scalar context, returns text of all pages, joined with a pagebreak \f character. In list context, returns text of pages one per element.
page()
Argument is page number (starting with 1) or abs path to temporary page file. Returns PDF::OCR2::Page object. Croaks if you ask for an invalid number.
pages_count()
Returns number of pages. Number of temporary files. Calls PDF::Burst.
CAVEATS
This only works on posix.
ERRORS
If Program dies
You call text() and you get a fatal. Loading a 'corrupt' pdf with PDF::API2 can trigger an error such as this;
Malformed xref in PDF file at /usr/lib/perl5/site_perl/5.8.8/PDF/API2/Basic/PDF/File.pm line 1198.
This happens because.. All pdfs are equal, but some pdfs are more equal than others. There's fifty kinds of pdf doc versions, etc. Sometimes the pdf is deemed to be corrupt by PDF::API2.
You can "fix" this problem with pdftk..
pdftk $in $out
But, this means modifying the original pdf, which is sketchy.
Maybe if the xref table is bad, we should run the operation on a repaired copy!
- Try using another burst method
-
If you have errors with PDF::API2 saying the pdf is corrupt, likely via PDF::Burst.. Then try this:
use PDF::OCR2; PDF::Burst::BURST_METHOD = 'CAM_PDF'; # and then... my $pdf = PDF::OCR2->new('./pathtofile.pdf'); print $pdf->text;
- Enable checking the pdf.
-
If you suspect the pdf is broken, or only want to run this on pdf docs that check ok with PDF::API2
use PDF::OCR2; $PDF::OCR2::CHECK_PDF = 1; # and then... my $pdf = PDF::OCR2->new('./pathtofile.pdf'); print $pdf->text;
This is not enabled by default because it is more expensive. Maybe it should be.
$PDF::OCR2::CHECK_PDF
By default this flag is on. We check the pdf with an eval to PDF::API2 to make sure the pdf does not have errors. This takes a small toll on performance. I suggest to leave it on.
$PDF::OCR2::REPAIR_XREF
By default this flag is off. If the pdf checks bad, we attempt to repair the pdf *to a copy of the file*- this file is put alongside the original and is named $filename_repaired_xref_table.pdf, once this is created, we check again.
So, if both CHECK_PDF and REPAIR_XREF flags are on;
1. the pdf is checked for correctness
2. if the pdf is bad, we attempt to fix to a copy of the file
3. if we can't make a fixed copy, we don't die, but warn and return undef
Thus, if check pdf fails, and repair xref flag is on, we are doing two evals, it could be argued this is expensive, and it is- but then- ocr is expensive, period.
CRIT AND SUGGESTIONS
The AUTHOR is open to any suggestions and requests.
SEE ALSO
PDF::Burst Split a pdf into pages.
PDF::GetImages Split a page into images.
PDF::OCR2::Page Part of this distro.
pdfcheck Included is a program that may be of use. It helps to check a pdf for problems, stats. Very alpha, useful though.
REPLACES
PDF::OCR - deprecated by this module.
AUTHOR
Leo Charre leocharre at cpan dot org
THANKS
These people have made useful inquiries, requests, critiques, code suggestions. Ultimately, they help develop this work.
COPYRIGHT
Copyright (c) 2009 Leo Charre. All rights reserved.
LICENSE
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, i.e., under the terms of the "Artistic License" or the "GNU General Public License".
DISCLAIMER
This package is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the "GNU General Public License" for more details.