NAME

PDF::Extract - Extracting sub PDF documents from a multi page PDF document

SYNOPSIS

use PDF::Extract;
$pdf=new PDF::Extract;
$pdf->servePDFExtract( PDFDoc=>"c:/Docs/my.pdf", PDFPages=>"1-3 31-36" );

or

use PDF::Extract;
$pdf = new PDF::Extract( PDFDoc=>'C:/my.pdf' );
$pdf->getPDFExtract( PDFPages=>@PDFPages );
print "Content-Type text/plain\n\n<xmp>",  $pdf->getVars("PDFExtract");
print $pdf->getVars("PDFError");

DESCRIPTION

PDF Extract is a group of methods that allow the user to quickly grab pages as a new PDF document from a pre-existing PDF document.

With PDF::Extract a new PDF document can be:-
  • assigned to a scalar variable with getPDFExtract.

  • saved to disk with savePDFExtract.

  • printed to STDOUT as a PDF web document with servePDFExtract.

  • cached and served for a faster PDF web document service with fastServePDFExtract.

These four main methods can be called with or without arguments. The methods will not work unless they know the location of the original PDF document and the pages to extract. There are no default values.

There are four other methods that deal with setting and getting the public variables.

  • getPDFExtractVariables can return an array of variables.

  • getVars is an alias of getPDFExtractVariables

  • setPDFExtractVariables can set the public variables.

  • setVars is an alias of setPDFExtractVariables

METHODS

new PDF::Extract

Creates a new Extract object with empty state information ready for processing data both input and output. New can be called with a hash array argument.

new PDF::Extract( PDFDoc=>"c:/Docs/my.pdf", PDFPages=>"1-3 31-36" )

This will cause a new PDF document to be generated unless there is an error. Extract->new() simply calls getPDFExtract() if there is an argument.

getPDFExtract

This method is the main workhorse of the package. It does all the PDF processing and sets PDFError if its unable to create a new PDF document. It requires PDFDoc and PDFPages to be set either in this call of before to function. It outputs a PDF document as a string or a "0" if there is an error.

To create an array of PDF documents, each consisting of a single page, from a multi page PDF document.

$pdf = new PDF::Extract( PDFDoc=>'C:/my.pdf' );
while ( $pdf[$i]=$pdf->getPDFExtract( PDFPages=>++$i ) );

The lowest valid page number for PDFPages is 1. A value of 0 will produce no output and raise an error. An error will be raised if the PDFPages value does not correspond to any pages.

savePDFExtract

This method saves its output to the directory defined for PDFCache. The new PDF's filename will be an amalgam of the original filename, the requested page numbers separated with an underscore "_" for individual pages, ".." for a range of pages and the .pdf file type suffix.

$pdf->savePDFExtract(PDFPages=>"1 3-5", PDFDoc=>'C:/my.pdf', PDFCache=>"C:/myCache" );

If there is an error then an error page will be served and savePDFExtract will return a "0". Otherwise savePDFExtract will return "1" and the saved PDF location and file name will be "C:/myCache/my1_3..5.pdf".

servePDFExtract

This method serves its output to STDOUT with the correct header for a PDF document served on the web.

$pdf = PDF::Extract->new(
           PDFDoc=>'C:/my.pdf', 
           PDFErrorPage=>"C:/myErrorPage.html" );
$pdf->servePDFExtract( PDFPages=>1);

If there is an error then an error page will be served and servePDFExtract will return "0". Otherwise servePDFExtract will return "1"

fastServePDFExtract

This method serves its output to STDOUT with the correct header for a PDF document served on the web.

This method checks to see if the PDF document requested is in the cache folder, as set with PDFCache. The file in the cache folder is served if it exists. Otherwise a new PDF document is created, cached and served. If there is an error then an error page will be served and fastServePDFExtract will return "0". fastServePDFExtract will return "1" on success.

$pdf->setVars(
           PDFDoc=>'C:/my.pdf', 
           PDFCache=>"C:/myCache", 
           PDFErrorPage=>"C:/myErrorPage.html",
           PDFPages=>1);
unless ($pdf->fastServePDFExtract ) {   
   # there was an error  
   $error=$pdf->getVars("PDFError") ;
}

getPDFExtractVariables

Get any of the public variables using a list of the variables to get

($error,$found)=$pdf->getPDFExtractVariables( "PDFError", "PDFPagesFound");

This method returns an an array of variables corresponding to the named variables passed in as arguments. If a variable is undefined then its returned value will be undefined.

getVars

This methos is an alias for getPDFExtractVariables. Get any of the public variables using a list of the variables to get

@vars=$pdf->getVars( @varNames );

This method returns an an array of variables corresponding to the named variables passed in as arguments. If a variable is undefined then its returned value will be undefined.

setPDFExtractVariables

Set any of the public variables using a hash of the variables and their values.

($doc,$pages)=$pdf->setPDFExtractVariables(PDFDoc=>'C:/my.pdf', PDFPages=>1);

This method sets the variables specified in the argument hash. They return an array of the new values set.

setVars

This methos is an alias for setPDFExtractVariables. Set any of the public variables using a hash of the variables and their values.

@vars=$pdf->setVars( %vars );

This method sets the variables specified in the argument hash. They return an array of the new values set.

VARIABLES

PDFDoc (set and get)

$file=$pdf->getVars("PDFDoc");

This variable contains the path to the last original PDF document accessed by getPDFExtract, savePDFExtract, servePDFExtract and fastServePDFExtract. PDFDoc will be an empty string if there was an error.

PDFPages (set and get)

$pages=$pdf->setVars("PDFPages"=>"1 18-23");
or
$pages=$pdf->getVars("PDFPages");

This variable contains a list of pages to extract from the original PDF document accessed by getPDFExtract, savePDFExtract, servePDFExtract and fastServePDFExtract.

PDFCache (set and get)

$cachePath=$pdf->setVars("PDFCache"=>"C:/myCache");
or
$cachePath=$pdf->getVars("PDFCache");

This variable contains the path to the PDF document cache. This value is required by savePDFExtract and fastServePDFExtract method calls. PDFCache will be an empty string if there was an error in setting the value.

PDFErrorPage (set and get)

$errorPagePath=$pdf->setVars("PDFErrorPage"=>"C:/myError.html");
or
$errorPagePath=$pdf->getVars("PDFErrorPage");

PDFErrorPage is a text file that can be used as a template for the error page. If the PDFErrorPage contains [PDFError], the word PDFError surrounded by square brackets, then the error description will replace [PDFError]. Otherwise you can devise a generic error description and describe remedial actions to be taken by the viewer.

If this variable is not set then a default error page will be used. The default page has a message in red at the top, "There is system problem in processing your PDF Pages request.", and then a description of the actual error follows underneath in black.

PDFExtract (get only)

$out=$pdf->getVars("PDFExtract");

This variable contains the last PDF document processed by getPDFExtract, savePDFExtract, servePDFExtract and fastServePDFExtract. PDFExtract will be an empty string if there was an error.

PDFPagesFound (get only)

$pagesFound=$pdf->getVars("PDFPagesFound");

This variable contains a comma seperated list of the page numbers that were selected and found within the original PDF document. PDFPagesFound will be a undefined if there was an error in finding any pages.

PDFPageCount (get only)

$pageCount=$pdf->getVars("PDFPageCount");

This variable contains the number of the pages that were selected and found within the original PDF document. PDFPageCount will be an empty string if there was an error in finding any pages.

PDFError (get only)

$error=$pdf->getVars("PDFError");

This variable contains a string describing the errors if any in processing the original PDF file. PDFError is guarenteed to be set if getPDFExtract, savePDFExtract, servePDFExtract or fastServePDFExtract fail and return a "0". PDFError will be an empty string if there was no error.

AUTHOR

Noel Sharrock <mailto:nsharrok@lgmedia.com.au>

PDF::Extract's home page http://www.lgmedia.com.au/PDF/Extract.asp

SUPPORT

Much thanks to Lyman Byrd for his welcome programming suggestions and editorial comments on the POD.

COPYRIGHT

Copyright (c) 2003 by Noel Sharrock. All rights reserved.

LICENSE

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, i.e., under the terms of the ``Artistic License'' or the ``GNU General Public License''.

The C library at the core of this Perl module can additionally be redistributed and/or modified under the terms of the ``GNU Library General Public License''.

DISCLAIMER

This package is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

See the ``GNU General Public License'' for more details.

PDF::Extract - Extracting sub PDF documents from a multipage PDF document