Changes for version 0.04

  • INCOMPATIBLE CHANGE: refactored files related code (data files now stored as GZip archives rather than plaintext files)
  • INCOMPATIBLE CHANGE: tokens_bounds() now returns zero-based index of the boundary instead of the position of the character after
  • data files are now represented with classes and proper API
  • few small bugfixes
  • split tests for tokens() and tokens_bounds(), enable tests for the latter
  • data files now have their own version, independent from module's version
  • data_dir is now configurable in constructor
  • other small fixes and improvments

Documentation

download newer data for tokenizer

Modules

tokenizer for OpenCorpora project
download newer data for tokenizer

Provides

in lib/Lingua/RU/OpenCorpora/Tokenizer/List.pm
in lib/Lingua/RU/OpenCorpora/Tokenizer/Vectors.pm