NAME
Lingua::Diversity::Internals - utility subroutines for developers of classes derived from Lingua::Diversity
VERSION
This documentation refers to Lingua::Diversity::Internals version 0.01.
SYNOPSIS
use Lingua::Diversity::Utils qw( split_text split_tagged_text );
my $text = 'of the people, by the people, for the people';
# Get a reference to an array of words...
my $word_array_ref = split_text(
'text' => \$text,
'regexp' => qr{[^a-zA-Z]+},
);
# Alternatively, tag the text using Lingua::TreeTagger...
use Lingua::TreeTagger;
my $tagger = Lingua::TreeTagger->new(
'language' => 'english',
'options' => [ qw( -token -lemma -no-unknown ) ],
);
my $tagged_text = $tagger->tag_text( \$text );
# ... get a reference to an array of words...
$word_array_ref = Lingua::Diversity::Utils->split_tagged_text(
'tagged_text' => $tagged_text,
'unit' => 'original',
);
# ... or get a reference to an array of wordforms and an array of lemmas.
( $wordform_array_ref, $lemma_array_ref )= split_tagged_text(
'tagged_text' => $tagged_text,
'unit' => 'original',
'category' => 'lemma',
);
DESCRIPTION
This module provides utility subroutines intended to facilitate the use of a class derived from Lingua::Diversity.
SUBROUTINES
split_text()
-
Split a text into units (typically words), delete empty units, and return a reference to the array of units.
The subroutine requires one named parameter and may take up to two of them.
split_tagged_text()
-
Given a Lingua::TreeTagger::TaggedText object, return a reference to the array of units (e.g. wordforms). Optionally, return a second reference to the array of categories (e.g. lemmas).
The subroutine requires two named parameter and may take up to three of them.
- tagged_text (required)
-
The Lingua::TreeTagger::TaggedText object to be split.
- unit (required)
-
The Lingua::TreeTagger::Token attribute (either 'original', 'lemma', or 'tag') that should be used to build the unit array. NB: make sure the requested attribute is available in the Lingua::TreeTagger::TaggedText object!
- category
-
The Lingua::TreeTagger::Token attribute (either 'lemma' or 'tag') that should be used to build the category array. NB: make sure the requested attribute is available in the Lingua::TreeTagger::TaggedText object!
DIAGNOSTICS
- Missing parameter 'text' in call to subroutine split_text()
-
This exception is raised when subroutine split_text() is called without a parameter named 'text' (whose value should be a reference to a string).
- Missing parameter 'tagged_text' in call to subroutine split_tagged_text()
-
This exception is raised when subroutine split_tagged_text() is called without a parameter named 'tagged_text').
- Parameter 'tagged_text' in call to subroutine split_tagged_text() must be a Lingua::TreeTagger::TaggedText object
-
This exception is raised when subroutine split_tagged_text() is called with a parameter named 'tagged_text' whose value is not a Lingua::TreeTagger::TaggedText object.
- Missing parameter 'unit' in call to subroutine split_tagged_text()
-
This exception is raised when subroutine split_tagged_text() is called without a parameter named 'unit').
- Parameter 'unit' in call to subroutine split_tagged_text() must be either 'original', 'lemma', or 'tag'
-
This exception is raised when subroutine split_tagged_text() is called with a parameter named 'unit' whose value is not 'original', 'lemma', or 'tag'.
- Parameter 'category' in call to subroutine split_tagged_text() must be either 'lemma' or 'tag'
-
This exception is raised when subroutine split_tagged_text() is called with a parameter named 'category' whose value is not 'lemma' or 'tag'.
DEPENDENCIES
This module is part of the Lingua::Diversity distribution. Some subroutines are designed to operate on Lingua::TreeTagger::TaggedText objects.
BUGS AND LIMITATIONS
There are no known bugs in this module.
Please report problems to Aris Xanthos (aris.xanthos@unil.ch)
Patches are welcome.
AUTHOR
Aris Xanthos (aris.xanthos@unil.ch)
LICENSE AND COPYRIGHT
Copyright (c) 2011 Aris Xanthos (aris.xanthos@unil.ch).
This program is released under the GPL license (see http://www.gnu.org/licenses/gpl.html).
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.