NAME
Treex::Block::W2A::ResegmentSentences - split bundles which contain more sentences
VERSION
version 2.20151102
MOTIVATION
Some resources (most notably WMT newstest) are segmented to chunks of text which mostly correspond to sentences, but sometimes contain more than one sentence. Sometimes we want to process such documents in Treex and output (Write::*) the result in a format where one output segement correspond to one input segement. (So e.g. for "one-sentence-per-line writers", we have the same number of input and output lines.)
However, most Treex blocks expect exactly one (linguistic) sentence in each bundle. The solution is to use block W2A::ResegmentSentences
after the reader and Misc::JoinBundles
before the writer.
DESCRIPTION
If the sentence segmenter says that the current sentence is actually composed of two or more sentences, then new bundles are inserted after the current bundle, each containing just one piece of the resegmented original sentence.
This block should be executed before tokenization (and tagging etc). It deals only with the (string) attribute sentence
in each zone, it does not process any trees.
All zones are processed. The number of bundles created is determined by the number of subsegments in the "current" zone (specified by the parameters language
and selector
). If a zone contains less subsegments than the current one, the remaining bundles will contain empty sentence. If a zone contains more subsegments than the current one, the remaining subsegments will be joined in the last bundle.
In other words, it is granted that the current zone, will not contain empty sentences.
As a special case if parameters language
and selector
define a zone which is not present in a bundle (this holds also for language=all), the "current" zone is the one with most subsegments, i.e. no subsegments are joined.
PARAMETERS
remove (no|all|diff) By setting parameter remove
you can delete some bundles. Default is remove=no. Setting remove=all will delete all bundles with more than one subsegments in the current zone. Setting remove=diff will delete all bundles that have (at least) two zones with different number of subsegments.
SEE ALSO
Treex::Block::Misc::JoinBundles
AUTHOR
Zdeněk Žabokrtský <zabokrtsky@ufal.mff.cuni.cz>
Martin Popel <popel@ufal.mff.cuni.cz>
COPYRIGHT AND LICENSE
Copyright © 2011-2012 by Institute of Formal and Applied Linguistics, Charles University in Prague
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.