NAME
Lingua::Thesaurus - Thesaurus management
SYNOPSIS
Creating a thesaurus
my $thesaurus = Lingua::Thesaurus->new(SQLite => $dbname);
$thesaurus->load($io_class => @files);
$thesaurus->load($io_class => {$origin1 => $file1, ...});
$thesaurus->load($io_class => {files => \@files,
params => {termClass => ..,
relTypeClass => ..}});
Using a thesaurus
my $thesaurus = Lingua::Thesaurus->new(SQLite => $dbname);
my @terms = $thesaurus->search_terms('*foo*');
my $term = $thesaurus->fetch_term('foobar');
my $scope_note = $term->SN; # returns a string
my @synonyms = $term->UF; # returns a list of other terms
foreach my $pair ($term->related(qw/NT RT/)) {
my ($rel_type, $item) = @$pair;
printf " %s(%s) = %s\n", $rel_type->description, $rel_type->rel_id, $item;
}
# transitive search
foreach my $quadruple ($term->transitively_related(qw/NT/)) {
my ($rel_type, $related_term, $through_term, $level) = @$quadruple;
printf " %s($level): %s (through %s)\n",
$rel_type->rel_id,
$level,
$related_term->string,
$through_term->string;
}
DESCRIPTION
This distribution manages thesauri. A thesaurus is a list of terms, with some relations (like for example "broader term" / "narrower term"). Relations are either "internal" (between two terms), or "external" (between a term and some external data, like for example a "Scope Note"). Relations may have a reciprocal; see Lingua::Thesaurus::RelType.
Thesauri are loaded from one or several IO formats; usually this will be the ISO 2788 format, or some derivative from it. See classes under the Lingua::Thesaurus::IO namespace for various implementations.
Once loaded, thesauri are stored via a storage class; this is meant to be an efficient internal structure for supporting searches. Currently, only Lingua::Thesaurus::Storage::SQLite is implemented; but the architecture allows for other storage classes to be defined, as long as they comply with the Lingua::Thesaurus::Storage role.
Terms are retrieved through the "search_terms" and "fetch_term" methods. The results are instances of Lingua::Thesaurus::Term; these objects have navigation methods for retrieving related terms.
This distribution was originally targeted for dealing with the Swiss thesaurus for justice "Jurivoc" (see Lingua::Thesaurus::IO::Jurivoc). However, the framework should be easily extensible to other needs. Other Perl modules for thesauri are briefly discussed below in the "SEE ALSO" section.
Side note: another motivation for writing this distribution was also to experiment with Moose meta-programming possibilities. Subclasses of Lingua::Thesaurus::Term are created dynamically for implementing relation methods NT
, BT
, etc. --- see Lingua::Thesaurus::Storage source code.
Caveat: at the moment, IO classes only implement loading and searching; methods for editing and dumping a thesaurus will be added in a future version.
METHODS
new
my $thesaurus = Lingua::Thesaurus->new($storage_class => @storage_args);
Instanciates a thesaurus on a given storage. The $storage_class
will be automatically prefixed by Lingua::Thesaurus::Storage::
, unless the classname contains an initial '+'
. The remaining arguments are transmitted to the storage class. Since Lingua::Thesaurus::Storage::SQLite is the default storage class supplied with this distribution, thesauri are usually opened as
my $dbname = '/path/to/some/file.sqlite';
my $thesaurus = Lingua::Thesaurus->new(SQLite => $dbname);
load
$thesaurus->load($io_class => @files);
$thesaurus->load($io_class => {$origin1 => $file1, ...});
$thesaurus->load($io_class => {files => \@files,
params => {termClass => ..,
relTypeClass => ..}});
Populates a thesaurus database with data from thesauri dumpfiles. The job of parsing these files is delegated to some IO
subclass, given as first argument. The $io_class
will be automatically prefixed by Lingua::Thesaurus::IO::
, unless the classname contains an initial '+'
. The remaining arguments are transmitted to the IO class; the simplest form is just a list of dumpfiles, or a hashref of pairs {$origin1 => $dumpfile1, ...}
. Each $origin
is a string for tagging terms coming from that dumpfile; while interrogating the thesaurus, origins can be retrieved from $term->origin
. See IO subclasses in the Lingua::Thesaurus::IO namespace for more details.
search_terms
my @terms = $thesaurus->search_terms($pattern, $origin);
Searches the term database according to $pattern
, where the pattern may contain '*'
to mean word completion.
The interpretation of patterns depends on the storage engine; by default, this is implemented using SQLite's "LIKE" function (see http://www.sqlite.org/lang_expr.html#like). Characters '*'
in the pattern are translated into '%'
for the LIKE function to work as expected.
It is also possible to configure the storage to use fulltext searches, so that a pattern such as 'sci*'
would also match 'computer science'
; see "use_fulltext" in Lingua::Thesaurus::Storage::SQLite.
If $pattern
is empty, the method returns the list of all terms in the thesaurus.
The second argument $origin
is optional; it may be used to restrict the search on terms loaded from one specific origin.
Results are instances of Lingua::Thesaurus::Term.
fetch_term
my $term = $thesaurus->fetch_term($term_string, $origin);
Retrieves a specific term and returns an instance of Lingua::Thesaurus::Term (or undef
if the term is unknown). The second argument $origin
is optional.
rel_types
Returns the list of ids of relation types stored in this thesaurus (i.e. 'NT', 'RT', etc.).
fetch_rel_type
my $rel_type = $thesaurus->fetch_rel_type($rel_type_id);
Returns the Lingua::Thesaurus::RelType object corresponding to $rel_type_id
.
storage
Returns the internal object playing role Lingua::Thesaurus::Storage.
FURTHER DOCUMENTATION
More details can be found in the various implementation classes :
Lingua::Thesaurus::IO : Role for input/output operations on a thesaurus
Lingua::Thesaurus::IO::ISO2788 : IO class for ISO thesauri (not implemented yet)
Lingua::Thesaurus::IO::Jurivoc : IO class for "Jurivoc", the Swiss thesaurus for justice
Lingua::Thesaurus::IO::LivelinkCollectionServer : IO class for Livelink Collection Server thesaurus files
Lingua::Thesaurus::RelType : Relation type in a thesaurus
Lingua::Thesaurus::Storage: Role for thesaurus storage
Lingua::Thesaurus::Storage::SQLite: Thesaurus storage in an SQLite database
Lingua::Thesaurus::Term: parent class for thesaurus terms; in particular, this class implements methods for navigating through relations.
SEE ALSO
Here is a brief review of some other thesaurus modules on CPAN :
Thesaurus has several backend implementations (CSV, BerkeleyDB, DBI), but it just handles synonyms (a single relation between terms).
Text::Thesaurus::ISO is quite old (1998), uses obsolete technology (
dbmopen
), and has a fixed number of relations, some of which are apparently targeted to the specific needs of UK electronic libraries.Biblio::Thesaurus has a rich set of features, not only for reading and searching, but also for editing and exporting a thesaurus. Storage is directly in hashes in memory; those can be saved into files in Storable format. The set of relations is flexible; it is read from the ISO dumpfiles. If it fits directly your needs, it's probably a good choice; but if you need to adapt/extend it, it's not totally obvious because all features are mingled into one monolithic module.
Biblio::Thesaurus::SQLite has an unclear status : it sits in the same namespace as Biblio::Thesaurus, and actually calls it in the source code, but doesn't inherit or call it. A separate API is provided for storing some thesaurus data into an SQLite database; but the full features of Biblio::Thesaurus are absent.
AUTHOR
Laurent Dami, <dami at cpan.org>
BUGS
Please report any bugs or feature requests to bug-lingua-thesaurus at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Lingua-Thesaurus. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc Lingua::Thesaurus
You can also look for information at:
RT: CPAN's request tracker (report bugs here)
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search MetaCPAN
LICENSE AND COPYRIGHT
Copyright 2013 Laurent Dami.
This program is free software; you can redistribute it and/or modify it under the terms of the the Artistic License (2.0). You may obtain a copy of the full license at:
http://www.perlfoundation.org/artistic_license_2_0
The test suite contains a short excerpt from the Swiss Jurivoc thesaurus, copyright 1999-2012 Tribunal fédéral Suisse (see http://www.bger.ch/fr/index/juridiction/jurisdiction-inherit-template/jurisdiction-jurivoc-home.htm).
TODO
Thesaurus
- support for multiple thesauri files (a term belongs to one-to-many
thesaurus files; a relation belongs to exactly one thesaurus file)
SQLite
- use_unaccent without fulltext ==> use collation sequence or redefine LIKE
- store thesaurus name for each term
=> adapt search_terms($pattern, $thes_name);