NAME
Lingua::LO::NLP::Syllabify - Segment Lao or mixed-script text into syllables.
FUNCTION
This implements a purely regular expression based algorithm to segment Lao text into syllables, based on the one described in PHISSAMAY et al: Syllabification of Lao Script for Line Breaking.
METHODS
new
new( $text, %options )
The constructor takes a mandatory argument containing the text to split, and any number of hash-style named options. Currently, the only such option is normalize
which takes a boolean argument and indicates whether to run the text though a normalization function that swaps tone marks and vowels appearing in the wrong order.
Note that in any case text is passed through "NFC" in Unicode::Normalize first to obtain the Composed Normal Form. In pure Lao text, this affects only the decomposed form of LAO VOWEL SIGN AM that will be transformed from U+0EB2
, U+0ECD
to U+0EB3
.
get_syllables
get_syllables()
Returns a list of Lao syllables found in the text passed to the constructor. If there are any blanks, non-Lao parts etc. mixed in, they will be silently dropped.
get_fragments
get_fragments()
Returns a complete segmentation of the text passed to the constructor as an array of hashes. Each hash has two keys:
text
-
The text of the respective fragment
is_lao
-
If true, the fragment is a single valid Lao syllable. If false, it may be whitespace, non-Lao script, Lao characters that don't constitute valid syllables - basically anything at all that's not a valid syllable.