NAME

Uplug::PreProcess::SentDetect - Moses/Europarl sentence boundary detector

SYNOPSIS

use Uplug::PreProcess::SentDetect;
my $splitter = Uplug::PreProcess::SentDetect->new (lang => 'en');
my $text = 'This is a paragraph. It contains several sentences. "But why," you ask?';
print $splitter->split($text);

DESCRIPTION

This module is basically a copy of Lingua::Sentence by Achim Ruopp adapted to Uplug which is based on tools developed for Moses and the Europarl corpus. All credits go to the original authors. This version includes some additional non-breaking prefix files.

This module allows splitting of text paragraphs into sentences. It is based on scripts developed by Philipp Koehn and Josh Schroeder for processing the Europarl corpus (http://www.statmt.org/europarl/).

The module uses punctuation and capitalization clues to split paragraphs into an newline-separated string with one sentence per line. For example:

This is a paragraph. It contains several sentences. "But why," you ask?

goes to:

This is a paragraph.
It contains several sentences.
"But why," you ask?

Languages currently supported by the module are:

ca (Catalan)
da (Danish)
de (German)
el (Greek)
en (English)
es (Spanish)
fr (French)
is (Icelandic)
it (Italian)
nl (Dutch)

-item pl (Polish)

pt (Portuguese)
ro (Romanian)
ru (Russian)
sk (Slovak)
sl (Slovene)
sv (Swedish)

Nonbreaking-Prefixes Files

Nonbreaking prefixes are loosely defined as any word ending in a period that does NOT indicate an end of sentence marker. A basic example is Mr. and Ms. in English.

The sentence splitter module uses the nonbreaking prefix files included in this distribution.

To add a file for other languages, follow the naming convention nonbreaking_prefix.?? and use the two-letter language code you intend to use when creating a Lingua::Sentence object.

The sentence splitter module will first look for a file for the language it is processing, and fall back to English if a file for that language is not found.

For the splitter, normally a period followed by an uppercase word results in a sentence split. If the word preceeding the period is a nonbreaking prefix, this line break is not inserted.

A special case of prefixes, NUMERIC_ONLY, is included for special cases where the prefix should be handled ONLY when before numbers. For example, "Article No. 24 states this." the No. is a nonbreaking prefix. However, in "No. It is not true." No functions as a word.

See the example prefix files included in the distribution for more examples.

CONSTRUCTOR

The constructor can be called in two ways:

Uplug::PreProcess::SentDetect->new (lang => $lang_id)

Instantiate an object to split sentences in language $lang_id. If the language is not supported, a splitter object for English will be instantiated.

CREDITS

Thanks for the following individuals for supplying nonbreaking prefix files: Bas Rozema (Dutch), Hilário Leal Fontes (Portuguese), Jesús Giménez (Catalan & Spanish)

SUPPORT

Bugs should always be submitted via the project hosting bug tracker

http://code.google.com/p/corpus-tools/issues/list

For other issues, contact the maintainer.

SEE ALSO

Text::Sentence, Lingua::EN::Sentence, Lingua::DE::Sentence, Lingua::HE::Sentence

AUTHOR

Lingua::Sentence: Achim Ruopp, <achimru@gmail.com>

Adapted to Uplug: Joerg Tiedemann

COPYRIGHT AND LICENSE

Copyright (C) 2010 by Digital Silk Road

Portions Copyright (C) 2005 by Philip Koehn and Josh Schroeder (used with permission)

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.