NAME

pdf2xml - extract text from PDF files and wraps it in XML

SYNOPSIS

pdf2xml [OPTIONS] pdf-file > output.xml

For more information, see the man-pages of the command-line tool pdf2xml. Using pdf2xml as a library is possible via the pdf2xml function:

use Text::PDF2XML

my $xml = pdf2xml( $pdf_file, %options );

pdf2xml( $pdf_file, output => \*STDOUT, %options );
pdf2xml( $pdf_file, output => 'file.xml', %options );

%options = (
   conversion_tool         => 'pdfXtk',        # use pdfXtk (default = 'tika')
   vocabulary              => 'filename',      # plain text file
   vocabulary_from_pdf     => 0,               # skip pdftotext
   vocabulary_from_raw_pdf => 0,               # skip pdftotext -raw
   vocabulary_from_tika    => 1,               # read voc from Apache Tika
   java                    => '/path/to/java', # java binary
   java_heap               => '8g',            # default = 1g
   split_into_characters   => 1,               # split into characters
   detect_languages        => 1,               # enable language detection
   keep_languages          => 'en',            # only keep English sentences
   lowercase               => 0,               # switch off lower-casing
   dehyphenate             => 0,               # switch off de-hyphenation
   character_merging       => 0,               # skip char merging
   paragraph_merging       => 0,               # skip paragraph merging
   verbose                 => 1                # verbose output
   );

pdf2xml( $pdf_file, output => 'file.xml', %options );

DESCRIPTION

Extract text from PDF using external tools and some post-processing heuristics. Here is an example with and without post-processing:

raw:    <p>PRESENTATION ET R A P P E L DES PRINCIPAUX RESULTATS 9</p>
clean:  <p>PRESENTATION ET RAPPEL DES PRINCIPAUX RESULTATS 9</p>

raw:    <p>2. Les c r i t è r e s de choix : la c o n s o m m a t i o n 
           de c o m b u s - t ib les et l e u r moda l i t é 
           d ' u t i l i s a t i on d 'une p a r t , 
           la concen t r a t ion d ' a u t r e p a r t 16</p>

clean:  <p>2. Les critères de choix : la consommation 
           de combustibles et leur modalité 
           d'utilisation d'une part, 
           la concentration d'autre part 16</p>

TODO

Character merging heuristics are very simple. Using the longest string forming a valid word from the vocabulary may lead to many incorrect words in context for some languages. Also, the implementation of the merging procedure is probably not the most efficient one.

De-hyphenation heuristics could also be improved. The problem is to keep it as language-independent as possible.

COPYRIGHT AND LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.

To install Text::PDF2XML, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Text::PDF2XML

CPAN shell

perl -MCPAN -e shell
install Text::PDF2XML

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

DESCRIPTION

TODO

SEE ALSO

COPYRIGHT AND LICENSE

Module Install Instructions