* v0.3.3
- added timeout parameter for requests to the ApacheTika server
- fallback to local java app if the ApacheTika server request fails or is incomplete
- catch parser failure and return undef instead of breaking
- removed Blacklist language identifier and use CLD instead
* v0.3.2
- reset vocabulary and language model for each call
- set locale to utf8 before calling external programs
* v0.3.1
- fixed lowercase option
* v0.3.0 Sat Jan 5 14:02:06 EET 2019
- moved options and functions to library
- enable different output options
- enabled calls to Apache Tika server if available
* v0.2.8 Fri Aug 17 10:04:58 EEST 2018
- update to Apache Tika 1.18
- hide warnings and error messages from external tools
* v0.2.7 Sat Jul 1 21:25:09 EEST 2017
- added an ugly workaround to find java 1.6 for pdfxtk
* v0.2.6 Thu Jan 9 16:10:29 CET 2014
- integrated language detection (-d)
- language filter using language detection (-D lang)
* v0.2.5
- merge paragraph heuristics for putting unfinished sentences together
- better approach for finding word boundaries based on
a unigram LM and dynamic programming
- better de-hyphenation in pdfxtk-mode
* v0.2.4 Fri Mar 15 10:34:42 CET 2013
- pdfxtk as default
- heuristics to handle ligatures
- dehyphenation and other heurstics in pdfxtk-mode (-X)
- now also splits strings into characters to find known words
(solves a problem with pdfxtk conversions)
* v0.2.3 Wed Mar 6 23:12:22 CET 2013
- fixed test suite
* v0.2.2 Wed Feb 27 20:33:09 CET 2013
- fixed problem with wrong shared-dir settings
- make word-merging a bit more efficient
* v0.2.1 Fri Feb 15 16:29:41 CET 2013
- add pdfXtk as another option for converting pdf files
(see http://sourceforge.net/projects/pdfxtk/)
* v0.2 - Thu Feb 7 10:51:16 CET 2013
- running without pdftotext is now possible
- added lowercasing (can be switched off)
* v0.1 - Tue Jan 29 20:31:42 CET 2013
- initial release