Changes for version 0.06

  • some changes to handle Unicode more or less properly: normalization, unicode classes in regular expressions
  • speed optimizations
  • synced algorithm with current PHP version
  • changed tests to use empirically found threshold
  • data update

Documentation

download newer data for tokenizer

Modules

tokenizer for OpenCorpora project
download newer data for tokenizer
represents a file with vectors