NAME
Lingua::HE::Sentence - Module for splitting Hebrew text into sentences.
SYNOPSIS
use Lingua::HE::Sentence qw( get_sentences );
my $sentences=get_sentences($text); ## Get the sentences.
foreach my $sentence (@$sentences) {
## do something with $sentence
}
DESCRIPTION
The Lingua::HE::Sentence
module contains the function get_sentences, which splits Hebrew text into its constituent sentences, based on regular expressions.
The module assumes text encoded in UTF-8. Supporting other input formats will be added upon request.
HEBREW DETAILS
Language: Hebrew Language ID: he MS Locale ID: 1037 ISO 639-1: he ISO 639-2 (MARC): heb ISO 8859 (charset): 8859-8 ANSI codepage: 1255 Unicode: 0590-05FF
PROBLEM DESCRIPTION
Many applications in natural language processing require some knowledge of sentence boundaries. The problem of properly locating sentence bonudaries in text in Hebrew is in many ways less severe than the same problem in other languages. The purpose of this module is to supply Perl users with a tool which can take plain text in Hebrew and get an ordered list of the sentences in the text.
PROPERTIES OF HEBREW SENTENCES
The following facts are part of the guidelines given by the 'academy of the Hebrew language'.
Sentences usually end with one of the following punctuation symbols: . a dot ? a question mark ! an exclamation mark
No dot should be placed after sentences on titles (such as book names, chpter titles etc.)
A dot can be placed after letters and numbers used for listing items, chapters etc., as long as these letters or numbers are not placed on a special line. When these letters or numbers appear alone, no dot should succeed them. Brackets or a closing bracket can be used instead of a dot in this case.
Decimal point should be represented with a dot and not a comma in order to distinguish the number from its decimal fraction.
In some rare cases semicolons also represent end of sentence, but usually the sentences separated by sa semicolor are practically one long sentence. I chose not to split on semicolons at all.
ASSUMPTIONS
Input text is assumed to be represented in UTF-8
Input text is assumed to have some structure, i.e. titles are separated from the rest of the text with at least a couple of newline characters ('\n').
Input is expected to follow the PROPERTIES listed above.
Complex sentences should be further segmented using clause identificatoin algorithms, this module will not provide (at least in this version) any support for clause identification and segmentation.
FUNCTIONS
All functions used should be requested in the 'use' clause. None is exported by default.
- get_sentences( $text )
-
The get sentences function takes a scalar containing ascii text as an argument and returns a reference to an array of sentences that the text has been split into. Returned sentences will be trimmed (beginning and end of sentence) of white-spaces. Strings with no alpha-numeric characters in them, won't be returned as sentences.
- get_EOS( )
-
This function returns the value of the string used to mark the end of sentence. You might want to see what it is, and to make sure your text doesn't contain it. You can use set_EOS() to alter the end-of-sentence string to whatever you desire.
- set_EOS( $new_EOS_string )
-
This function alters the end-of-sentence string used to mark the end of sentences.
BUGS
No proper handling of sentence boundaries within and in presence or quotes (either single or dounle).
FUTURE WORK (in no particular order)
- [0] Write tests!
- [1] Object Oriented like usage.
- [2] Supporting more encodings/charsets.
- [3] Code cleanup and optimization.
- [4] Fix bugs.
- [5] Generate sentencizer based on supervised learning. (requires tagged texts...)
SEE ALSO
Lingua::EN::Sentence
AUTHOR
Shlomo Yona shlomo@cs.haifa.ac.il
COPYRIGHT
Copyright (c) 2001-2004 Shlomo Yona. All rights reserved.
This library is free software. You can redistribute it and/or modify it under the same terms as Perl itself.
4 POD Errors
The following errors were encountered while parsing the POD:
- Around line 76:
'=item' outside of any '=over'
- Around line 90:
You forgot a '=back' before '=head1'
- Around line 96:
'=item' outside of any '=over'
- Around line 108:
You forgot a '=back' before '=head1'