NAME
Lingua::Ident - Statistical language identification
SYNOPSIS
use Lingua::Ident;
$classifier = new Lingua::Ident("filename 1", ..., "filename n");
$lang = $classifier->identify("text to classify");
$probabilities = $classifier->calculate("text to classify");
DESCRIPTION
This module implements a statistical language identifier based on the approach Ted Dunning described in his 1994 report Statistical Identification of Language.
METHODS
Lingua::Ident->new($filename, ...)
Construct a new classifier. The filename arguments to the constructor must refer to files containing tables of n-gram probabilites for languages (language models). These tables can be generated using the trainlid(1) utility program.
$classifier->identify($string)
Identify the language of a text given in $string. The identify() method returns the value specified in the _LANG field of the probabilities table of the language in which the text is most likely written (see "WARNINGS" below).
Internally, the identify() method calls the calculate() method.
$classifier->calculate($string)
Calculate the probabilities for a text to be in the languages known to the classifier. This method returns a reference to an array. The array represents a table of languages and the probabiliy for each language. Each array element is a reference to an array containing two elements: The language name and the associated probability. For example, you may get something like this:
[['de.iso-8859-1', -317.980835274509],
['en.iso-8859-1', -450.804230119916], ...]
The elements are sorted in descending order by probability. You can use this data to assess the reliability of the categorization and make your own decision using application-specific metrics.
When neither a trigram nor a bigram is found, the calculation deviates slightly from the formula given by Dunning (1994). According to Dunning's formula, one would estimate the probability as:
p = log(1/#alph)
where #alph is the size of the alphabet of a particular language. This penalizes different language models with different values because the alphabet sizes of the languages differ.
However, the size of the alphabet is much larger for Asian languages than for European languages. For example, for the sample data in the Lingua::Ident distribution trainlid(1) reports #alph = 127 for zh.big5 vs. #alph = 31 for de.iso-8859-1. This means that Asian languages are penalized much harder than European languages when an estimation must be made.
To use the same penalty for all languages, calculate() now uses the average of all alphabet sizes instead.
NOTE: This has only been lightly tested yet--feedback is welcome.
WARNINGS
Since Lingua::Ident is based on statistics it cannot be 100% accurate. More precisely, Dunning (see below) reports his implementation to achieve 92% accuracy with 50 KB of training text for 20-character strings discriminating between English and Spanish. This implementation should be as accurate as Dunning's. However, not only the size but also the quality of the training text plays a role.
The current implementation doesn't use a threshold to determine if the most probable language has a high enough probability; if you're trying to classify a text in a language for which there is no probability table, this results in getting an incorrect language.
AUTHOR
Lingua::Ident was developed by Michael Piotrowski <mxp@dynalabs.de>.
LICENSE
This program is free software; you may redistribute it and/or modify it under the same terms as Perl itself.
SEE ALSO
Dunning, Ted (1994). Statistical Identification of Language. Technical report CRL MCCS-94-273. Computing Research Lab, New Mexico State University.