Changes for version 0.06
- some changes to handle Unicode more or less properly: normalization, unicode classes in regular expressions
- speed optimizations
- synced algorithm with current PHP version
- changed tests to use empirically found threshold
- data update
Documentation
download newer data for tokenizer
Modules
tokenizer for OpenCorpora project
represents a data file
download newer data for tokenizer
represents a file with vectors