NAME

Treex::Tool::Segment::RuleBased - Rule based pseudo language-independent sentence segmenter

VERSION

version 2.20151102

DESCRIPTION

Sentence boundaries are detected based on a regex rules that detect end-sentence punctuation ([.?!]) followed by a uppercase letter. This class is implemented in a pseudo language-independent way, but it can be used as an ancestor for language-specific segmentation by overriding the method segment_text (using around see Moose::Manual::MethodModifiers) or just by overriding methods unbreakers, openings and closings.

See Treex::Block::W2A::EN::Segment

METHODS

get_segments

Returns list of sentences

METHODS TO OVERRIDE

segment_text

Do the segmentation (handling use_paragraphs and use_lines)

$text = split_at_terminal_punctuation($text)

Adds newlines after terminal punctuation followed by an uppercase letter.

$text = apply_contextual_rules($text)

Add unbreakers (<<<DOT>>>) and hard breaks (\n) using the whole context, not just a single word.

unbreakers

Returns regex that should match tokens that usually do not end a sentence even if they are followed by a period and a capital letter: * single uppercase letters serve usually as first name initials * in language-specific descendants consider adding: * period-ending items that never indicate sentence breaks * titles before names of persons etc.

openings

Returns string with characters that can appear before the first word of a sentence

closings

Returns string with characters that can appear after period (or other end-sentence symbol)

AUTHOR

Martin Popel <popel@ufal.mff.cuni.cz>

Ondřej Dušek <odusek@ufal.mff.cuni.cz>

COPYRIGHT AND LICENSE

Copyright © 2011-2012 by Institute of Formal and Applied Linguistics, Charles University in Prague

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.