NAME
Treex::Block::W2A::EN::FixTokenization - fix some issues in output of tokenizer
VERSION
version 0.06903_1
DESCRIPTION
Some abbreviations (with periods) are merged into one token. For example "e. g." is in Penn Treebank one token (with tag FW). Using only SEnglishW_to_SEnglishM::Penn_style_tokenization we get four tokens: e . g . which may be distributed by the parser into different clauses. And this is hard to fix afterwards.
OVERRIDEN METHODS
from Treex::Core::Block
- process_atree
AUTHOR
Martin Popel <popel@ufal.mff.cuni.cz>
COPYRIGHT AND LICENSE
Copyright © 2009 - 2011 by Institute of Formal and Applied Linguistics, Charles University in Prague
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.