NAME

pdf2xml - extract text from PDF files and wraps it in XML

USAGE

pdf2xml [OPTIONS] pdf-file > output.xml

OPTIONS

-c ............. split strings into character sequences before finding words
-d ............. detect language for each paragraph
-D lang ........ ignore all paragraphs that do not match language <lang>
-h ............. skip de-hypenation (keep hyphenated words)
-H ............. max heap size for Java VM
-J path ........ path to Java
-l lexicon ..... provide a list of words or a text in the target language
-L ............. skip lowercasing (which is switched on by default)
-m ............. skip merging character sequences (not recommended)
-M ............. skip paragraph merging heuristics
-o output.xml .. output file
-r ............. skip 'pdftotext -raw'
-x ............. skip standard 'pdftotext'
-X ............. use pdfXtk to convert to XHTML
-T ............. use Apache Tika for the basic conversion (default)
-v ............. verbose output

DESCRIPTION

pdf2xml tries to combine the output of several conversion tools in order to improve the extraction of text from PDF documents. Currently, it uses pdftotext, Apache Tika and pdfxtk. In the default mode, it calls all tools to extract text and pdfxtk is used to create the basic XML file that will be used to produce the final output. Several post-processing heuristics are implemented to split and merge character sequences in order to cleanup the text. Consider the example given below:

raw:    <p>PRESENTATION ET R A P P E L DES PRINCIPAUX RESULTATS 9</p>
clean:  <p>PRESENTATION ET RAPPEL DES PRINCIPAUX RESULTATS 9</p>

raw:    <p>2. Les c r i t è r e s de choix : la c o n s o m m a t i o n 
           de c o m b u s - t ib les et l e u r moda l i t é 
           d ' u t i l i s a t i on d 'une p a r t , 
           la concen t r a t ion d ' a u t r e p a r t 16</p>

clean:  <p>2. Les critères de choix : la consommation 
           de combustibles et leur modalité 
           d'utilisation d'une part, 
           la concentration d'autre part 16</p>

TODO

This is quite slow and loading Apache Tika for each conversion is not very efficient. Using the server mode of Apache Tika would be a solution or inline-Java and direct calls to external libraries.

COPYRIGHT AND LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.

To install Text::PDF2XML, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Text::PDF2XML

CPAN shell

perl -MCPAN -e shell
install Text::PDF2XML

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)