NAME
Treex::Tool::Segment::RuleBased - Rule based pseudo language-independent sentence segmenter
VERSION
version 2.20151102
DESCRIPTION
Sentence boundaries are detected based on a regex rules that detect end-sentence punctuation ([.?!]) followed by a uppercase letter. This class is implemented in a pseudo language-independent way, but it can be used as an ancestor for language-specific segmentation by overriding the method segment_text
(using around
see Moose::Manual::MethodModifiers) or just by overriding methods unbreakers
, openings
and closings
.
See Treex::Block::W2A::EN::Segment
METHODS
- get_segments
-
Returns list of sentences
METHODS TO OVERRIDE
- segment_text
-
Do the segmentation (handling
use_paragraphs
anduse_lines
) - $text = split_at_terminal_punctuation($text)
-
Adds newlines after terminal punctuation followed by an uppercase letter.
- $text = apply_contextual_rules($text)
-
Add unbreakers (
<<<DOT>>>
) and hard breaks (\n
) using the whole context, not just a single word. - unbreakers
-
Returns regex that should match tokens that usually do not end a sentence even if they are followed by a period and a capital letter: * single uppercase letters serve usually as first name initials * in language-specific descendants consider adding: * period-ending items that never indicate sentence breaks * titles before names of persons etc.
- openings
-
Returns string with characters that can appear before the first word of a sentence
- closings
-
Returns string with characters that can appear after period (or other end-sentence symbol)
AUTHOR
Martin Popel <popel@ufal.mff.cuni.cz>
Ondřej Dušek <odusek@ufal.mff.cuni.cz>
COPYRIGHT AND LICENSE
Copyright © 2011-2012 by Institute of Formal and Applied Linguistics, Charles University in Prague
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.