NAME
Treex::Block::W2A::Segment - rule based segmentation to sentences
VERSION
version 2.20151102
SYNOPSIS
# in scenario
W2A::Segment use_paragraphs=1 use_lines=0
DESCRIPTION
Sentence boundaries are detected based on a regex rules that detect end-sentence punctuation ([.?!]) followed by an uppercase letter. This class is implemented in a pseudo language-independent way, but it can be used as a base class for language-specific segmentation by overriding the method get_segments
(using around
see Moose::Manual::MethodModifiers). The actual implementation is delegated to Treex::Tool::Segment::RuleBased.
ATTRIBUTES
use_paragraphs
Should paragraph boundaries be preserved as sentence boundaries? Paragraph boundary is defined as two or more consecutive newlines.
use_lines
Should newlines in the text be preserved as sentence boundaries? However, if you want to detect sentence boundaries just based on newlines and nothing else, use rather W2A::SegmentOnNewlines.
limit_words
Should very long segments (longer than the given number of words) be split? The number of words is only approximate; detected by counting whitespace only, not by full tokenization. Set to zero to disable this function completely (default is 250 as longer sentences often cause the parser to fail).
detect_lists
Minimum number of words on a line to toggle list detection rules, 0 = never, 1 = always (default: 100). The number of words is detected by counting whitespace only.
SEE ALSO
Treex::Tool::Segment::RuleBased
Treex::Block::W2A::EN::Segment
AUTHOR
Martin Popel <popel@ufal.mff.cuni.cz>
COPYRIGHT AND LICENSE
Copyright © 2011-2012 by Institute of Formal and Applied Linguistics, Charles University in Prague
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.