NAME

Lingua::Diversity::Utils - utility subroutines for users of classes derived from Lingua::Diversity

VERSION

This documentation refers to Lingua::Diversity::Utils version 0.02.

SYNOPSIS

use Lingua::Diversity::Utils qw( split_text split_tagged_text );

my $text = 'of the people, by the people, for the people';

# Get a reference to an array of words...
my $word_array_ref = split_text(
    'text'      => \$text,
    'regexp'    => qr{[^a-zA-Z]+},
);

# Alternatively, tag the text using Lingua::TreeTagger...
use Lingua::TreeTagger;
my $tagger = Lingua::TreeTagger->new(
    'language' => 'english',
    'options'  => [ qw( -token -lemma -no-unknown ) ],
);
my $tagged_text = $tagger->tag_text( \$text );

# ... get a reference to an array of words...
$word_array_ref = Lingua::Diversity::Utils->split_tagged_text(
    'tagged_text'   => $tagged_text,
    'unit'          => 'original',
);

# ... or get a reference to an array of wordforms and an array of lemmas.
( $wordform_array_ref, $lemma_array_ref )= split_tagged_text(
    'tagged_text'   => $tagged_text,
    'unit'          => 'original',
    'category'      => 'lemma',
);

DESCRIPTION

This module provides utility subroutines intended to facilitate the use of a class derived from Lingua::Diversity.

SUBROUTINES

split_text()

Split a text into units (typically words), delete empty units, and return a reference to the array of units.

The subroutine requires one named parameter and may take up to two of them.

text (required)

A reference to the text to be split.

regexp

A reference to a regular expression describing unit delimiter sequences. Default is qr{\s+}.

split_tagged_text()

Given a Lingua::TreeTagger::TaggedText object, return a reference to the array of units (e.g. wordforms). Optionally, return a second reference to the array of categories (e.g. lemmas).

The subroutine requires two named parameters and may take up to three of them.

tagged_text (required)

The Lingua::TreeTagger::TaggedText object to be split.

unit (required)

The Lingua::TreeTagger::Token attribute (either 'original', 'lemma', or 'tag') that should be used to build the unit array. NB: make sure the requested attribute is available in the Lingua::TreeTagger::TaggedText object!

category

The Lingua::TreeTagger::Token attribute (either 'lemma' or 'tag') that should be used to build the category array. NB: make sure the requested attribute is available in the Lingua::TreeTagger::TaggedText object!

DIAGNOSTICS

Missing parameter 'text' in call to subroutine split_text()

This exception is raised when subroutine split_text() is called without a parameter named 'text' (whose value should be a reference to a string).

Missing parameter 'tagged_text' in call to subroutine split_tagged_text()

This exception is raised when subroutine split_tagged_text() is called without a parameter named 'tagged_text').

Parameter 'tagged_text' in call to subroutine split_tagged_text() must be a Lingua::TreeTagger::TaggedText object

This exception is raised when subroutine split_tagged_text() is called with a parameter named 'tagged_text' whose value is not a Lingua::TreeTagger::TaggedText object.

Missing parameter 'unit' in call to subroutine split_tagged_text()

This exception is raised when subroutine split_tagged_text() is called without a parameter named 'unit').

Parameter 'unit' in call to subroutine split_tagged_text() must be either 'original', 'lemma', or 'tag'

This exception is raised when subroutine split_tagged_text() is called with a parameter named 'unit' whose value is not 'original', 'lemma', or 'tag'.

Parameter 'category' in call to subroutine split_tagged_text() must be either 'lemma' or 'tag'

This exception is raised when subroutine split_tagged_text() is called with a parameter named 'category' whose value is not 'lemma' or 'tag'.

DEPENDENCIES

This module is part of the Lingua::Diversity distribution. Some subroutines are designed to operate on Lingua::TreeTagger::TaggedText objects.

BUGS AND LIMITATIONS

There are no known bugs in this module.

Please report problems to Aris Xanthos (aris.xanthos@unil.ch)

Patches are welcome.

AUTHOR

Aris Xanthos (aris.xanthos@unil.ch)

LICENSE AND COPYRIGHT

Copyright (c) 2011 Aris Xanthos (aris.xanthos@unil.ch).

This program is released under the GPL license (see http://www.gnu.org/licenses/gpl.html).

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

SEE ALSO

Lingua::Diversity, Lingua::TreeTagger.