NAME

Lingua::Thesaurus - Thesaurus management

SYNOPSIS

Creating a thesaurus

my $thesaurus = Lingua::Thesaurus->new(SQLite => $dbname);
$thesaurus->load($io_class => @files);
$thesaurus->load($io_class => {$origin1 => $file1, ...});
$thesaurus->load($io_class => {files => \@files,
                               params  => {termClass => ..,
                                           relTypeClass => ..}});

Using a thesaurus

my $thesaurus = Lingua::Thesaurus->new(SQLite => $dbname);

my @terms = $thesaurus->search_terms('*foo*');
my $term  = $thesaurus->fetch_term('foobar');

my $scope_note = $term->SN; # returns a string
my @synonyms   = $term->UF; # returns a list of other terms

foreach my $pair ($term->related(qw/NT RT/)) {
  my ($rel_type, $item) = @$pair;
  printf "  %s(%s) = %s\n", $rel_type->description, $rel_type->rel_id, $item;
}

# transitive search
foreach my $quadruple ($term->transitively_related(qw/NT/)) {
  my ($rel_type, $related_term, $through_term, $level) = @$quadruple;
  printf "  %s($level): %s (through %s)\n", 
     $rel_type->rel_id,
     $level,
     $related_term->string,
     $through_term->string;
}

DESCRIPTION

This distribution manages thesauri. A thesaurus is a list of terms, with some relations (like for example "broader term" / "narrower term"). Relations are either "internal" (between two terms), or "external" (between a term and some external data, like for example a "Scope Note"). Relations may have a reciprocal; see Lingua::Thesaurus::RelType.

Thesauri are loaded from one or several IO formats; usually this will be the ISO 2788 format, or some derivative from it. See classes under the Lingua::Thesaurus::IO namespace for various implementations.

Once loaded, thesauri are stored via a storage class; this is meant to be an efficient internal structure for supporting searches. Currently, only Lingua::Thesaurus::Storage::SQLite is implemented; but the architecture allows for other storage classes to be defined, as long as they comply with the Lingua::Thesaurus::Storage role.

Terms are retrieved through the "search_terms" and "fetch_term" methods. The results are instances of Lingua::Thesaurus::Term; these objects have navigation methods for retrieving related terms.

This distribution was originally targeted for dealing with the Swiss thesaurus for justice "Jurivoc" (see Lingua::Thesaurus::IO::Jurivoc). However, the framework should be easily extensible to other needs. Other Perl modules for thesauri are briefly discussed below in the "SEE ALSO" section.

Side note: another motivation for writing this distribution was also to experiment with Moose meta-programming possibilities. Subclasses of Lingua::Thesaurus::Term are created dynamically for implementing relation methods NT, BT, etc. --- see Lingua::Thesaurus::Storage source code.

Caveat: at the moment, IO classes only implement loading and searching; methods for editing and dumping a thesaurus will be added in a future version.

METHODS

new

my $thesaurus = Lingua::Thesaurus->new($storage_class => @storage_args);

Instanciates a thesaurus on a given storage. The $storage_class will be automatically prefixed by Lingua::Thesaurus::Storage::, unless the classname contains an initial '+'. The remaining arguments are transmitted to the storage class. Since Lingua::Thesaurus::Storage::SQLite is the default storage class supplied with this distribution, thesauri are usually opened as

my $dbname = '/path/to/some/file.sqlite';
my $thesaurus = Lingua::Thesaurus->new(SQLite => $dbname);

load

$thesaurus->load($io_class => @files);
$thesaurus->load($io_class => {$origin1 => $file1, ...});
$thesaurus->load($io_class => {files => \@files,
                               params  => {termClass    => ..,
                                           relTypeClass => ..}});

Populates a thesaurus database with data from thesauri dumpfiles. The job of parsing these files is delegated to some IO subclass, given as first argument. The $io_class will be automatically prefixed by Lingua::Thesaurus::IO::, unless the classname contains an initial '+'. The remaining arguments are transmitted to the IO class; the simplest form is just a list of dumpfiles, or a hashref of pairs {$origin1 => $dumpfile1, ...}. Each $origin is a string for tagging terms coming from that dumpfile; while interrogating the thesaurus, origins can be retrieved from $term->origin. See IO subclasses in the Lingua::Thesaurus::IO namespace for more details.

search_terms

my @terms = $thesaurus->search_terms($pattern, $origin);

Searches the term database according to $pattern, where the pattern may contain '*' to mean word completion.

The interpretation of patterns depends on the storage engine; by default, this is implemented using SQLite's "LIKE" function (see http://www.sqlite.org/lang_expr.html#like). Characters '*' in the pattern are translated into '%' for the LIKE function to work as expected.

It is also possible to configure the storage to use fulltext searches, so that a pattern such as 'sci*' would also match 'computer science'; see "use_fulltext" in Lingua::Thesaurus::Storage::SQLite.

If $pattern is empty, the method returns the list of all terms in the thesaurus.

The second argument $origin is optional; it may be used to restrict the search on terms loaded from one specific origin.

Results are instances of Lingua::Thesaurus::Term.

fetch_term

my $term = $thesaurus->fetch_term($term_string, $origin);

Retrieves a specific term and returns an instance of Lingua::Thesaurus::Term (or undef if the term is unknown). The second argument $origin is optional.

rel_types

Returns the list of ids of relation types stored in this thesaurus (i.e. 'NT', 'RT', etc.).

fetch_rel_type

my $rel_type = $thesaurus->fetch_rel_type($rel_type_id);

Returns the Lingua::Thesaurus::RelType object corresponding to $rel_type_id.

storage

Returns the internal object playing role Lingua::Thesaurus::Storage.

FURTHER DOCUMENTATION

More details can be found in the various implementation classes :

SEE ALSO

Here is a brief review of some other thesaurus modules on CPAN :

  • Thesaurus has several backend implementations (CSV, BerkeleyDB, DBI), but it just handles synonyms (a single relation between terms).

  • Text::Thesaurus::ISO is quite old (1998), uses obsolete technology (dbmopen), and has a fixed number of relations, some of which are apparently targeted to the specific needs of UK electronic libraries.

  • Biblio::Thesaurus has a rich set of features, not only for reading and searching, but also for editing and exporting a thesaurus. Storage is directly in hashes in memory; those can be saved into files in Storable format. The set of relations is flexible; it is read from the ISO dumpfiles. If it fits directly your needs, it's probably a good choice; but if you need to adapt/extend it, it's not totally obvious because all features are mingled into one monolithic module.

  • Biblio::Thesaurus::SQLite has an unclear status : it sits in the same namespace as Biblio::Thesaurus, and actually calls it in the source code, but doesn't inherit or call it. A separate API is provided for storing some thesaurus data into an SQLite database; but the full features of Biblio::Thesaurus are absent.

AUTHOR

Laurent Dami, <dami at cpan.org>

BUGS

Please report any bugs or feature requests to bug-lingua-thesaurus at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Lingua-Thesaurus. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc Lingua::Thesaurus

You can also look for information at:

LICENSE AND COPYRIGHT

Copyright 2013 Laurent Dami.

This program is free software; you can redistribute it and/or modify it under the terms of the the Artistic License (2.0). You may obtain a copy of the full license at:

http://www.perlfoundation.org/artistic_license_2_0

The test suite contains a short excerpt from the Swiss Jurivoc thesaurus, copyright 1999-2012 Tribunal fédéral Suisse (see http://www.bger.ch/fr/index/juridiction/jurisdiction-inherit-template/jurisdiction-jurivoc-home.htm).

TODO

Thesaurus

- support for multiple thesauri files (a term belongs to one-to-many
  thesaurus files; a relation belongs to exactly one thesaurus file)

SQLite

- use_unaccent without fulltext ==> use collation sequence or redefine LIKE
- store thesaurus name for each term
   => adapt search_terms($pattern, $thes_name);