The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

CAM::PDF - PDF manipulation library

LICENSE

Copyright 2005 Clotho Advanced Media, Inc., <cpan@clotho.com>

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SYNOPSIS

use CAM::PDF;

my $pdf = new CAM::PDF('test1.pdf');

my $page1 = $pdf->getPageContent(1);
[ ... mess with page ... ]
$pdf->setPageContent(1, $page1);
[ ... create some new content ... ]
$pdf->appendPageContent(1, $newcontent);

my @prefs = $pdf->getPrefs();
$prefs[$CAM::PDF::PREF_OPASS] = 'mypassword';
$pdf->setPrefs(@prefs);

$pdf->cleanoutput('out1.pdf');

Many example scripts are included in this distribution to do basic tasks.

DESCRIPTION

This package reads and writes any document that conforms to the PDF specification generously provided by Adobe at (as of Oct 2005) http://partners.adobe.com/public/developer/pdf/index_reference.html

The file format is well-supported, with the exception of the "linearized" or "optimized" output format, which this module can read but not write. Many specific aspects of the document model are not manipulable with this package (like fonts), but if the input document is correctly written, then this module will preserve the model integrity.

This library grants you some power over the PDF security model. Note that applications editing PDF documents via this library MUST respect the security preferences of the document. Any violation of this respect is contrary to Adobe's intellectual property position, as stated in the reference manual at the above URL.

Technical detail regarding corrupt PDFs: This library adheres strictly to the PDF specification. Adobe's Acrobat Reader is more lenient, allowing some corrupted PDFs to be viewable. Therefore, it is possible that some PDFs may be readable by Acrobat that are illegible to this library. In particular, files which have had line endings converted to or from DOS/Windows style (i.e. CR-NL) may be rendered unusable even though Acrobat does not complain. Future library versions may relax the parser, but not yet.

COMPATIBILITY

This library was primarily developed against the 3rd edition of the reference (PDF v1.4) with a few updates from 4th edition. This library focuses on PDF v1.2 features. It should be forward and backward compatible in the majority of cases.

PERFORMANCE

This module is written with good speed and flexibility in mind, often at the expense of memory consumption. Entire PDF documents are typically slurped into RAM. As an example, simply calling new() the 14 MB Adobe PDF Reference V1.5 document pushes Perl to consume 84 MB of RAM on my development machine.

API

Functions intended to be used externally

$self = CAM::PDF->new(content | filename | '-')
$self->toPDF()
$self->needsSave()
$self->save()
$self->cleansave()
$self->output(filename | '-')
$self->cleanoutput(filename | '-')
$self->preserveOrder()
$self->appendObject(olddoc, oldnum, [follow=(1|0)])
$self->replaceObject(newnum, olddoc, oldnum, [follow=(1|0)])
   (olddoc can be undef in the above for adding new objects)
$self->numPages()
$self->getPageText(pagenum)
$self->getPageContent(pagenum)
$self->setPageContent(pagenum, content)
$self->appendPageContent(pagenum, content)
$self->deletePage(pagenum)
$self->deletePages(pagenum, pagenum, ...)
$self->extractPages(pagenum, pagenum, ...)
$self->appendPDF(CAM::PDF object)
$self->prependPDF(CAM::PDF object)
$self->wrapString(string, width, fontsize, page, fontlabel)
$self->getFontNames(pagenum)
$self->addFont(page, fontname, fontlabel, [fontmetrics])
$self->deEmbedFont(page, fontname, [newfontname])
$self->deEmbedFontByBaseName(page, basename, [newfont])
$self->getPrefs()
$self->setPrefs()
$self->canPrint()
$self->canModify()
$self->canCopy()
$self->canAdd()
$self->getFormFieldList()
$self->fillFormFields(fieldname, value, [fieldname, value, ...])
  or $self->fillFormFields(%values)
$self->clearFormFieldTriggers(fieldname, fieldname, ...)

Note: 'clean' as in 'cleansave' and 'cleanobject' means write a fresh PDF document. The alternative (e.g. 'save') reuses the existing doc and just appends to it. Also note that 'clean' functions sort the objects numerically. If you prefer that the new PDF docs more closely resemble the old ones, call 'preserveOrder' before 'cleansave' or 'cleanobject.'

Slightly less external, but useful, functions

$self->toString()
$self->getPage(pagenum)
$self->getFont(pagenum, fontname)
$self->getFonts(pagenum)
$self->getStringWidth(fontdict, string)
$self->getFormField(fieldname)
$self->getFormFieldDict(object)
$self->isLinearized()
$self->decodeObject(objectnum)
$self->decodeAll(any-node)
$self->decodeOne(dict-node)
$self->encodeObject(objectnum, filter)
$self->encodeOne(any-node, filter)
$self->changeString(obj-node, hashref)

Deeper utilities

$self->pageAddName(pagenum, name, objectnum)
$self->getPageObjnum(pagenum)
$self->getPropertyNames(pagenum)
$self->getProperty(pagenum, propname)
$self->getValue(any-node)
$self->dereference(objectnum)  or $self->dereference(name,pagenum)
$self->deleteObject(objectnum)
$self->copyObject(obj-node)
$self->cacheObjects()
$self->setObjNum(obj-node, num)
$self->getRefList(obj-node)
$self->changeRefKeys(obj-node, hashref)

More rarely needed utilities

$self->getObjValue(objectnum)

Routines that should not be called

$self->_startdoc()
$self->delinearlize()
$self->build*()
$self->parse*()
$self->write*()
$self->*CB()
$self->traverse()
$self->fixDecode()
$self->abbrevInlineImage()
$self->unabbrevInlineImage()
$self->cleanse()
$self->clean()
$self->createID()

FUNCTIONS

Object creation/manipulation

new PACKAGE, CONTENT
new PACKAGE, CONTENT, OWNERPASS, USERPASS
new PACKAGE, CONTENT, OWNERPASS, USERPASS, PROMPT?
new PACKAGE, CONTENT, OWNERPASS, USERPASS, OPTIONS

Instantiate a new CAM::PDF object. CONTENT can be a ducument in a string, a filename, or '-'. The latter indicates that the document should be read from standard input. If the document is password protected, the passwords should be passed as additional arguments. If they are not known, a boolean argument allows the programmer to suggest that the constructor prompt the user for a password. This is rudimentary prompting: passwords are in the clear on the console.

This constructor takes an optional final argument which is a hash reference. This hash can contain any of the following optional parameters:

prompt_for_password => BOOLEAN

This is the same as the PROMPT argument described above.

fault_tolerant => BOOLEAN

This flag causes the instance to be more lenient when reading the input PDF. Currently, this only affects PDFs which cannot be successfully decrypted.

toPDF

Serializes the data structure as a PDF document stream and returns as in a scalar.

toString

Returns a serialized representation of the data structure. Implemented via Data::Dumper.

Document reading

(all of these functions are intended for internal only)

getRootDict

Returns the Root dictionary for the PDF.

getPagesDict

Returns the root Pages dictionary for the PDF.

parseObj STRING

Use parseAny() instead of this, if possible.

Given a fragment of PDF page content, parse it and return an object Node. This can be called as a class method in most circumstances, but is intended as an instance method.

parseInlineImage STRING
parseInlineImage STRING, OBJNUM
parseInlineImage STRING, OBJNUM, GENNUM

Given a fragment of PDF page content, parse it and return an object Node. This can be called as a class method in some cases, but is intended as an instance method.

writeInlineImage OBJECTNODE

This is the inverse of parseInlineImage, intended for use only in the CAM::PDF::Content class.

parseStream STRING, OBJNUM, GENNUM, DICTNODE

This should only be used by parseObj(), or other specialized cases.

Given a fragment of PDF page content, parse it and return a stream Node. This can be called as a class method in most circumstances, but is intended as an instance method.

The dictionary Node argument is typically the body of the object Node that precedes this stream.

parseDict STRING
parseDict STRING, OBJNUM
parseDict STRING, OBJNUM, GENNUM

Use parseAny() instead of this, if possible.

Given a fragment of PDF page content, parse it and return an dictionary Node. This can be called as a class method in most circumstances, but is intended as an instance method.

parseArray STRING
parseArray STRING, OBJNUM
parseArray STRING, OBJNUM, GENNUM

Use parseAny() instead of this, if possible.

Given a fragment of PDF page content, parse it and return an array Node. This can be called as a class or instance method.

parseLabel STRING
parseLabel STRING, OBJNUM
parseLabel STRING, OBJNUM, GENNUM

Use parseAny() instead of this, if possible.

Given a fragment of PDF page content, parse it and return a label Node. This can be called as a class or instance method.

parseRef STRING
parseRef STRING, OBJNUM
parseRef STRING, OBJNUM, GENNUM

Use parseAny() instead of this, if possible.

Given a fragment of PDF page content, parse it and return a reference Node. This can be called as a class or instance method.

parseNum STRING
parseNum STRING, OBJNUM
parseNum STRING, OBJNUM, GENNUM

Use parseAny() instead of this, if possible.

Given a fragment of PDF page content, parse it and return a number Node. This can be called as a class or instance method.

parseString STRING
parseString STRING, OBJNUM
parseString STRING, OBJNUM, GENNUM

Use parseAny() instead of this, if possible.

Given a fragment of PDF page content, parse it and return a string Node. This can be called as a class or instance method.

parseHexString STRING
parseHexString STRING, OBJNUM
parseHexString STRING, OBJNUM, GENNUM

Use parseAny() instead of this, if possible.

Given a fragment of PDF page content, parse it and return a hexstring Node. This can be called as a class or instance method.

parseBoolean STRING
parseBoolean STRING, OBJNUM
parseBoolean STRING, OBJNUM, GENNUM

Use parseAny() instead of this, if possible.

Given a fragment of PDF page content, parse it and return a boolean Node. This can be called as a class or instance method.

parseNull STRING
parseNull STRING, OBJNUM
parseNull STRING, OBJNUM, GENNUM

Use parseAny() instead of this, if possible.

Given a fragment of PDF page content, parse it and return a null Node. This can be called as a class or instance method.

parseAny STRING
parseAny STRING, OBJNUM
parseAny STRING, OBJNUM, GENNUM

Given a fragment of PDF page content, parse it and return a Node of the appropriate type. This can be called as a class or instance method.

Data Accessors

getValue OBJECT

For INTERNAL use

Dereference a data object, return a value. Given an node object of any kind, returns raw scalar object: hashref, arrayref, string, number. This function follows all references, and descends into all objects.

getObjValue OBJECTNUM

For INTERNAL use

Dereference a data object, and return a value. Behaves just like the getValue() function, but used when all you know is the object number.

dereference OBJECTNUM
dereference NAME, PAGENUM

For INTERNAL use

Dereference a data object, return a PDF object as an node. This function makes heavy use of the internal object cache. Most (if not all) object requests should go through this function.

NAME should look something like '/R12'.

getPropertyNames PAGENUM
getProperty PAGENUM, PROPERTYNAME

Each PDF page contains a list of resources that it uses (images, fonts, etc). getPropertyNames() returns an array of the names of those resources. getProperty() returns a node representing a named property (most likely a reference node).

getFont PAGENUM, FONTNAME

For INTERNAL use

Returns a dictionary for a given font identified by its label, referenced by page.

getFontNames PAGENUM

For INTERNAL use

Returns a list of fonts for a given page.

getFonts PAGENUM

For INTERNAL use

Returns an array of font objects for a given page.

getFontByBaseName PAGENUM, FONTNAME

For INTERNAL use

Returns a dictionary for a given font, referenced by page and the name of the base font.

getFontMetrics PROPERTIES FONTNAME

For INTERNAL use

Returns a data structure representing the font metrics for the named font. The property list is the results of something like the following:

$self->_buildNameTable($pagenum);
my $properties = $self->{Names}->{$pagenum};

Alternatively, if you know the page number, it might be easier to do:

my $font = $self->dereference($fontlabel, $pagenum);
my $fontmetrics = $font->{value}->{value};

where the fontlabel is something like '/Helv'. The getFontMetrics method is useful in the cases where you've forgotten which page number you are working on (e.g. in CAM::PDF::GS), or if your property list isn't part of any page (e.g. working with form field annotation objects).

addFont PAGENUM, FONTNAME, FONTLABEL
addFont PAGENUM, FONTNAME, FONTLABEL, FONTMETRICS

Adds a reference to the specified font to the page.

If a fontmetrics hash is supplied (it is required for a font other than the 14 core fonts), then it is cloned and inserted into the new font structure. Note that if those fontmetrics contain references (e.g. to the FontDescriptor), the referred objects are not copied -- you must do that part yourself.

For Type1 fonts, the fontmetrics must minimally contain the following fields: Subtype, FirstChar, LastChar, Widths, FontDescriptor.

deEmbedFont PAGENUM, FONTNAME
deEmbedFont PAGENUM, FONTNAME, BASEFONT

Removes embedded font data, leaving font reference intact. Returns true if the font exists and 1) font is not embedded or 2) embedded data was successfully discarded. Returns false if the font does not exist, or the embedded data could not be discarded.

The optional basefont parameter allows you to change the font. This is useful when some applications embed a standard font (see below) and give it a funny name, like 'SYLXNP+Helvetica'. In this example, it's important to change the basename back to the standard 'Helvetica' when dembedding.

De-embedding the font does NOT remove it from the PDF document, it just removes references to it. To get a size reduction by throwing away unused font data, you should use the following code sometime after this method.

$self->cleanse();

For reference, the standard fonts are Times-Roman, Helvetica, and Courier (and their bold, italic and bold-italic forms) plus Symbol and Zapfdingbats. (Adobe PDF Reference v1.4, p.319)

deEmbedFontByBaseName PAGENUM, FONTNAME
deEmbedFontByBaseName PAGENUM, FONTNAME, BASEFONT

Just like deEmbedFont(), except that the font name parameter refers to the name of the current base font instead of the PDF label for the font.

wrapString STRING, WIDTH, FONTSIZE, FONTMETRICS
wrapString STRING, WIDTH, FONTSIZE, PAGENUM, FONTLABEL

Returns an array of strings wrapped to the specified width.

getStringWidth FONTMETRICS, STRING

For INTERNAL use

Returns the width of the string, using the font metrics if possible.

numPages

Returns the number of pages in the PDF document.

getPage PAGENUM

For INTERNAL use

Returns a dictionary for a given numbered page.

getPageObjnum PAGENUM

For INTERNAL use

Return the number of the PDF object in which the specified page occurs.

getPageText PAGENUM

Extracts the text from a PDF page as a string.

getPageContentTree PAGENUM

Retrieves a parsed page content data structure, or undef if there is a syntax error or if the page does not exist.

getPageContent PAGENUM

Return a string with the layout contents of one page.

getName OBJECT

For INTERNAL use

Given a PDF object reference, return it's name, if it has one. This is useful for indirect references to images in particular.

getPrefs

Return an array of security information for the document:

owner password
user password
print boolean
modify boolean
copy boolean
add boolean

See the PDF reference for the intended use of the latter four booleans.

This module publishes the array indices of these values for your convenience:

$CAM::PDF::PREF_OPASS
$CAM::PDF::PREF_UPASS
$CAM::PDF::PREF_PRINT
$CAM::PDF::PREF_MODIFY
$CAM::PDF::PREF_COPY
$CAM::PDF::PREF_ADD

So, you can retrieve the value of the Copy boolean via:

my ($canCopy) = ($self->getPrefs())[$CAM::PDF::PREF_COPY];
canPrint

Return a boolean indicating whether the Print permission is enabled on the PDF.

canModify

Return a boolean indicating whether the Modify permission is enabled on the PDF.

canCopy

Return a boolean indicating whether the Copy permission is enabled on the PDF.

canAdd

Return a boolean indicating whether the Add permission is enabled on the PDF.

getFormFieldList

Return an array of the names of all of the PDF form fields. The names are the full heirarchical names constructed as explained in the PDF reference manual. These names are useful for the fillFormFields() function.

getFormField NAME

For INTERNAL use

Return the object containing the form field definition for the specified field name. NAME can be either the full name or the "short/alternate" name.

getFormFieldDict FORMFIELDOBJECT

For INTERNAL use

Return a hashreference representing the accumulated property list for a formfield, including all of it's inherited properties. This should be treated as a read-only hash! It ONLY retrieves the properties it knows about.

Data/Object Manipulation

setPrefs OWNERPASS, USERPASS, PRINT?, MODIFY?, COPY?, ADD?

Alter the document's security information. Note that modifying these parameters must be done respecting the intellectual property of the original document. See Adobe's statement in the introduction of the reference manual.

setName OBJECT, NAME

For INTERNAL use

Change the name of a PDF object structure.

removeName OBJECT

For INTERNAL use

Delete the name of a PDF object structure.

pageAddName PAGENUM, NAME, OBJECTNUM

For INTERNAL use

Append a named object to the metadata for a given page.

setPageContent PAGENUM, CONTENT

Replace the content of the specified page with a new version. This function is often used after the getPageContent() function and some manipulation of the returned string from that function.

appendPageContent PAGENUM, CONTENT

Add more content to the specified page. Note that this function does NOT do any page metadata work for you (like creating font objects for any newly defined fonts).

extractPages PAGES...

Remove all pages from the PDF except the specified ones. Like deletePages(), the pages can be multiple arguments, comma separated lists, ranges (open or closed).

deletePages PAGES...

Remove the specified pages from the PDF. The pages can be multiple arguments, comma separated lists, ranges (open or closed).

deletePage PAGENUM

Remove the specified page from the PDF. If the PDF has only one page, this method will fail.

decachePages PAGENUM, PAGENUM, ...

Clears cached copies of the specified page data structures. This is useful if an operation has been performed that changes a page.

addPageResources PAGENUM, RESOURCEHASH

Add the resources from the given object to the page resource dictionary. If the page does not have a resource dictionary, create one. This function avoids duplicating resources where feasible.

appendPDF PDF

Append pages from another PDF document to this one. No optimization is done -- the pieces are just appended and the internal table of contents is updated.

Note that this can break documents with annotations. See the appendpdf.pl script for a workaround.

prependPDF PDF

Just like appendPDF() except the new document is inserted on page 1 instead of at the end.

duplicatePage PAGENUM
duplicatePage PAGENUM, LEAVEBLANK

Inserts an identical copy of the specified page into the document. The new page's number will be pagenum + 1.

If leaveblank is true, the new page does not get any content. Thus, the document is broken until you subsequently call setPageContent().

createStreamObject CONTENT
createStreamObject CONTENT, FILTER ...

For INTERNAL use

Create a new Stream object. This object is NOT added to the document. Use the appendObject() function to do that after calling this function.

uninlineImages
uninlineImages PAGENUM

Search the content of the specified page (or all pages if the page number is omitted) for embedded images. If there are any, replace them with indirect objects. This procedure uses heuristics to detect inline images, and is subject to confusion in extremely rare cases of text that uses "BI" and "ID" a lot.

appendObject DOC, OBJECTNUM, RECURSE?
appendObject undef, OBJECT, RECURSE?

Duplicate an object from another PDF document and add it to this document, optionally descending into the object and copying any other objects it references.

Like replaceObject(), the second form allows you to append a newly-created block to the PDF.

replaceObject OBJECTNUM, DOC, OBJECTNUM, RECURSE?
replaceObject OBJECTNUM, undef, OBJECT

Duplicate an object from another PDF document and insert it into this document, replacing an existing object. Optionally descend into the original object and copy any other objects it references.

If the other document is undefined, then the object to copy is taken to be an anonymous object that is not part of any other document. This is useful when you've just created that anonymous object.

deleteObject OBJECTNUM

Remove an object from the document. This function does NOT take care of dependencies on this object.

cleanse

Remove unused objects. WARNING: this function breaks some PDF documents because it removes objects that are strictly part of the page model heirarchy, but which are required anyway (like some font definition objects).

createID

For INTERNAL use

Generate a new document ID. Contrary the Adobe recommendation, this is a random number.

fillFormFields NAME => VALUE ...

Set the default values of PDF form fields. The name should be the full heirarchical name of the field as output by the getFormFieldList() function. The argument list can be a hash if you like. A simple way to use this function is something like this:

my %fields = (fname => 'John', lname => 'Smith', state => 'WI');
$field{zip} = 53703;
$self->fillFormFields(%fields);
clearFormFieldTriggers NAME, NAME, ...

Disable any triggers set on data entry for the specified form field names. This is useful in the case where, for example, the data entry javascript forbids punctuation and you want to prefill with a hyphenated word. If you don't clear the trigger, the prefill may not happen.

clearAnnotations

Remove all annotations from the document. If form fields are encountered, their text is added to the appropriate page.

Document Writing

preserveOrder

Try to recreate the original document as much as possible. This may help in recreating documents which use undocumented tricks of saving font information in adjacent objects.

isLinearized

Returns a boolean indicating whether this PDF is linearized (aka "optimized").

delinearize

For INTERNAL use

Undo the tweaks used to make the document 'optimized'. This function is automatically called on every save or output since this library does not yet support linearized documents.

clean

Cache all parts of the document and throw away it's old structure. This is useful for writing PDFs anew, instead of simply appending changes to the existing documents. This is called by cleansave and cleanoutput.

needsSave

Returns a boolean indicating whether the save() method needs to be called. Like save(), this has nothing to do with whether the document has been saved to disk, but whether the in-memory representation of the document has been serialized.

save

Serialize the document into a single string. All changed document elements are normalized, and a new index and an updated trailer are created.

This function operates solely in memory. It DOES NOT write the document to a file. See the output() function for that.

cleansave

Call the clean() function, then call the save() function.

output FILENAME
output

Save the document to a file. The save() function is called first to serialize the data structure. If no filename is specified, or if the filename is '-', the document is written to standard output.

Note: it is the responsibility of the application to ensure that the PDF document has either the Modify or Add permission. You can do this like the following:

if ($self->canModify()) {
   $self->output($outfile);
} else {
   die "The PDF file denies permission to make modifications\n";
}
cleanoutput FILE
cleanoutput

Call the clean() function, then call the output() function to write a fresh copy of the document to a file.

writeObject OBJNUM

Return the serialization of the specified object.

writeString STRING

Return the serialization of the specified string. Works on normal or hex strings. If encryption is desired, the string should be encrypted before being passed here.

writeAny NODE

Returns the serialization of the specified node. This handles all Node types, including object Nodes.

Document Traversing

traverse DEREFERENCE_FLAG, NODE, CALLBACKFUNC, CALLBACKDATA

Recursive traversal of a PDF data structure.

In many cases, it's useful to apply one action to every node in an object tree. The routines below all use this traverse() function. One of the most important parameters is the first: $deref=(1|0) If true, the traversal follows reference Nodes. If false, it does not descend into refererence Nodes.

decodeObject OBJECTNUM

For INTERNAL use

Remove any filters (like compression, etc) from a data stream indicated by the object number.

decodeAll OBJECT

For INTERNAL use

Remove any filters from any data stream in this object or any object referenced by it.

decodeOne OBJECT
decodeOne OBJECT, SAVE?

For INTERNAL use

Remove any filters from an object. The boolean flag SAVE (defaults to false) indicates whether this defiltering should be permanent or just this once. If true, the function returns success or failure. If false, the function returns the defiltered content.

fixDecode DATA, FILTER, PARAMS

This is a utility method to do any tweaking after removing the filter from a data stream.

encodeObject OBJECTNUM, FILTER

Apply the specified filter to the object.

encodeOne OBJECT, FILTER

Apply the specified filter to the object.

setObjNum OBJECT, OBJECTNUM

Descend into an object and change all of the INTERNAL object number flags to a new number. This is just for consistency of internal accounting.

getRefList OBJECT

For INTERNAL use

Return an array all of objects referred to in this object.

changeRefKeys OBJECT, HASHREF

For INTERNAL use

Renumber all references in an object.

abbrevInlineImage OBJECT

Contract all image keywords to inline abbreviations.

unabbrevInlineImage OBJECT

Expand all inline image abbreviations.

changeString OBJECT, HASHREF

Alter all instances of a given string. The hashref is a dictionary of oldstring and newstring. If the oldstring looks like 'regex(...)' then it is intrepreted as a Perl regular expresssion and is eval'ed. Otherwise the search-and-replace is literal.

Utility functions

rangeToArray MIN, MAX, LIST...

Converts string lists of numbers to an array. For example,

CAM::PDF->rangeToArray(1, 15, '1,3-5,12,9', '14-', '8 - 6, -2');

becomes

(1,3,4,5,12,9,14,15,8,7,6,1,2)
trimstr STRING

Used solely for debugging. Trims a string to a max of 40 characters, handling nulls and non-unix line endings.

sub trimstr { my $pkg_or_doc = shift; my $s = $_[0];

if (!defined $s || $s eq q{})
{
   $s = '(empty)';
}
elsif (length $s > 40)
{
   $s = substr($s, pos($_[0])||0, 40) . '...';
}
$s =~ s/\r/^M/gs;
return pos($_[0]).q{ }.$s."\n";
}
copyObject NODE

Clones a node via Data::Dumper and eval().

cacheObjects

Parses all object Nodes and stores them in the cache. This is useful for cases where you intend to do some global manipulation and want all of the data conveniently in RAM.

asciify STRING

Helper class/instance method to massage a string, cleaning up some non-ASCII problems. This is a very ad-hoc list. Specifically:

f-i ligatures
(R) symbol

INTERNALS

The data structure used to represent the PDF document is composed primarily of a heirarchy of Node objects. Every node in the document tree has this structure:

type => <type>
value => <value>
objnum => <object number>
gennum => <generation number>

where the <value> depends on the <type>, and <type> is one of

Type        Value
----        -----
object      Node
stream      byte string
string      byte string
hexstring   byte string
number      number
reference   integer (object number)
boolean     "true" | "false"
label       string
array       arrayref of Nodes
dictionary  hashref of (string => Node)
null        undef

All of these except "stream" are directly related to the PDF data types of the same name. Streams are treated as special cases in this library since the have a non-general syntax and placement in the document body. Internally, streams are very much like strings, except that they have filters applied to them.

All objects are referenced indirectly by their numbers, as defined in the PDF document. In all cases, the dereference() function should be used to deserialize objects into their internal representation. This function is also useful for looking up named objects in the page model metadata. Every node in the heirarchy contains its object and generation number. You can think of this as a sort of a pointer back to the root of each node tree. This serves in place of a "parent" link for every node, which would be harder to maintain.

The PDF document itself is represented internally as a hash reference with many components, including the document content, the document metadata (index, trailer and root node), the object cache, and several other caches, in addition to a few assorted bookkeeping structures.

The core of the document is represented in the object cache, which is only populated as needed, thus avoiding the overhead of parsing the whole document at read time.

AUTHOR

Clotho Advanced Media Inc., cpan@clotho.com

Primary developer: Chris Dolan