NAME

Treex::Block::W2A::EN::FixTokenization - fix some issues in output of tokenizer

VERSION

version 0.06903_1

DESCRIPTION

Some abbreviations (with periods) are merged into one token. For example "e. g." is in Penn Treebank one token (with tag FW). Using only SEnglishW_to_SEnglishM::Penn_style_tokenization we get four tokens: e . g . which may be distributed by the parser into different clauses. And this is hard to fix afterwards.

OVERRIDEN METHODS

from Treex::Core::Block

process_atree

AUTHOR

Martin Popel <popel@ufal.mff.cuni.cz>

COPYRIGHT AND LICENSE

Copyright © 2009 - 2011 by Institute of Formal and Applied Linguistics, Charles University in Prague

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.