NAME
Lingua::YALI::Examples - Examples of usages.
VERSION
version 0.010_02
Introduction
Preparation
Download training and test data
# download data
for i in `seq 1 20`; do
id=`printf "%02d" $i`;
echo "Processing document $id";
lynx --dump 'http://en.wikipedia.org/wiki/Special:Random' -noprint --nolist --nonumbers --nomargins -width=10000 > eng.$id.txt;
lynx --dump 'http://cs.wikipedia.org/wiki/Special:Random' -noprint --nolist --nonumbers --nomargins -width=10000 > ces.$id.txt;
lynx --dump 'http://fr.wikipedia.org/wiki/Special:Random' -noprint --nolist --nonumbers --nomargins -width=10000 > fra.$id.txt;
done;
# prepare training data
ls ces.* | head -n15 > list.ces.train;
ls eng.* | head -n15 > list.eng.train;
ls fra.* | head -n15 > list.fra.train;
# prepare testing data
ls ces.* | tail -n5 > list.ces.test;
ls eng.* | tail -n5 > list.eng.test;
ls fra.* | tail -n5 > list.fra.test;
Scripts
This section provides information how to use scripts yali-builder, yali-identifier, and yali-language-identifier.
Language Identification with Pretrained Models
# language identification for czech files
yali-language-identifier -l="eng ces fra" -filelist=list.ces.test
# language identification for english files with different output format
yali-language-identifier -l="eng ces fra" -filelist=list.eng.test -f=all_p
# language identification for french files read from STDIN
cat list.fra.test | yali-language-identifier -l="eng ces fra" -filelist=- -f=tabbed
# single file
yali-language-identifier -l="eng ces fra" -i=ces.20.txt -f=all
# single file read from STDIN
cat eng.20.txt | yali-identifier -l="eng ces fra" -i=- -f=all_p
# single file read from STDIN
cat fra.20.txt | yali-identifier -l="eng ces fra" -f=all_p
Building Own Models
# czech bigram model with only 5 most frequent bigrams stored
yali-builder --filelist=list.ces.train -n=2 -c=5 -o model.2.5.ces.gz
# english bigram model with only 5 most frequent bigrams stored
cat list.eng.train | yali-builder --filelist=- -n=2 -c=5 -o model.2.5.eng.gz
# french bigram model with only 5 most frequent bigrams stored
cat list.eng.train | xargs cat | yali-builder -i=- -n=2 -c=5 -o model.2.5.fra.gz
# create list with models
echo -e "ces\tmodel.2.5.ces.gz" > list.models.2
echo -e "eng\tmodel.2.5.eng.gz" >> list.models.2
echo -e "fra\tmodel.2.5.fra.gz" >> list.models.2
Language Identification with Own Models
Only two changes are required to commands presented in section "Language Identification with Pretrained Models".
Change yali-language-identifier to yali-identifier.
Change -l="eng ces fra" to -c=list.models.2.
# language identification for czech files
yali-identifier -c=list.models.2 -filelist=list.ces.test
# language identification for english files with different output format
yali-identifier -c=list.models.2 -filelist=list.eng.test -f=all_p
# language identification for french files read from STDIN
cat list.fra.test | yali-identifier -c=list.models.2 -filelist=- -f=tabbed
# single file
yali-identifier -c=list.models.2 -i=ces.20.txt -f=all
# single file read from STDIN
cat eng.20.txt | yali-identifier -c=list.models.2 -i=- -f=all_p
# single file read from STDIN
cat fra.20.txt | yali-identifier -c=list.models.2 -f=all_p
Modules
Language Identification
use Lingua::YALI::LanguageIdentifier;
// create identifier and register languages
my $identifier = Lingua::YALI::LanguageIdentifier->new();
$identifier->add_language("ces", "eng")
// identify string
my $result = $identifier->identify_string("CPAN, the Comprehensive Perl Archive Network, is an archive of modules written in Perl.");
print "The most probable language is " . $result->[0]->[0] . ".\n";
// prints out The most probable language is eng.
Training models
use Lingua::YALI::Builder;
use Lingua::YALI::Identifier;
// create models
my $builder_a = Lingua::YALI::Builder->new(ngrams=>[2]);
$builder_a->train_string("aaaaa aaaa aaa aaa aaa aaaaa aa");
$builder_a->store("model_a.2_all.gz", 2);
my $builder_b = Lingua::YALI::Builder->new(ngrams=>[2]);
$builder_b->train_string("bbbbbb bbbb bbbb bbb bbbb bbbb bbb");
$builder_b->store("model_b.2_all.gz", 2);
// create identifier and load models
my $identifier = Lingua::YALI::Identifier->new();
$identifier->add_class("a", "model_a.2_all.gz");
$identifier->add_class("b", "model_b.2_all.gz");
// identify strings
my $result1 = $identifier->identify_string("aaaaaaaaaaaaaaaaaaa");
print $result1->[0]->[0] . "\t" . $result1->[0]->[1];
// prints out a 1
my $result2 = $identifier->identify_string("bbbbbbbbbbbbbbbbbbb");
print $result2->[0]->[0] . "\t" . $result2->[0]->[1];
// prints out b 1
AUTHOR
Martin Majlis <martin@majlis.cz>
COPYRIGHT AND LICENSE
This software is Copyright (c) 2012 by Martin Majlis.
This is free software, licensed under:
The (three-clause) BSD License