NAME

Lingua::YALI::Examples - Examples of usages.

VERSION

version 0.010_02

Introduction

Preparation

Download training and test data

# download data
for i in `seq 1 20`; do
    id=`printf "%02d" $i`;
    echo "Processing document $id";
    lynx --dump 'http://en.wikipedia.org/wiki/Special:Random' -noprint --nolist --nonumbers --nomargins -width=10000 > eng.$id.txt;
    lynx --dump 'http://cs.wikipedia.org/wiki/Special:Random' -noprint --nolist --nonumbers --nomargins -width=10000 > ces.$id.txt;
    lynx --dump 'http://fr.wikipedia.org/wiki/Special:Random' -noprint --nolist --nonumbers --nomargins -width=10000 > fra.$id.txt;
done;

# prepare training data
ls ces.* | head -n15 > list.ces.train;
ls eng.* | head -n15 > list.eng.train;
ls fra.* | head -n15 > list.fra.train;

# prepare testing data
ls ces.* | tail -n5 > list.ces.test;
ls eng.* | tail -n5 > list.eng.test;
ls fra.* | tail -n5 > list.fra.test;

Scripts

This section provides information how to use scripts yali-builder, yali-identifier, and yali-language-identifier.

Language Identification with Pretrained Models

# language identification for czech files
yali-language-identifier -l="eng ces fra" -filelist=list.ces.test

# language identification for english files with different output format
yali-language-identifier -l="eng ces fra" -filelist=list.eng.test -f=all_p

# language identification for french files read from STDIN
cat list.fra.test | yali-language-identifier -l="eng ces fra" -filelist=- -f=tabbed

# single file
yali-language-identifier -l="eng ces fra" -i=ces.20.txt -f=all

# single file read from STDIN
cat eng.20.txt | yali-identifier -l="eng ces fra" -i=- -f=all_p

# single file read from STDIN
cat fra.20.txt | yali-identifier -l="eng ces fra" -f=all_p

Building Own Models

# czech bigram model with only 5 most frequent bigrams stored    
yali-builder --filelist=list.ces.train -n=2 -c=5 -o model.2.5.ces.gz

# english bigram model with only 5 most frequent bigrams stored    
cat list.eng.train | yali-builder --filelist=- -n=2 -c=5 -o model.2.5.eng.gz

# french bigram model with only 5 most frequent bigrams stored
cat list.eng.train | xargs cat | yali-builder -i=- -n=2 -c=5 -o model.2.5.fra.gz

# create list with models
echo -e "ces\tmodel.2.5.ces.gz" > list.models.2
echo -e "eng\tmodel.2.5.eng.gz" >> list.models.2
echo -e "fra\tmodel.2.5.fra.gz" >> list.models.2

Language Identification with Own Models

Only two changes are required to commands presented in section "Language Identification with Pretrained Models".

  • Change yali-language-identifier to yali-identifier.

  • Change -l="eng ces fra" to -c=list.models.2.

# language identification for czech files
yali-identifier -c=list.models.2 -filelist=list.ces.test

# language identification for english files with different output format
yali-identifier -c=list.models.2 -filelist=list.eng.test -f=all_p

# language identification for french files read from STDIN
cat list.fra.test | yali-identifier -c=list.models.2 -filelist=- -f=tabbed

# single file
yali-identifier -c=list.models.2 -i=ces.20.txt -f=all

# single file read from STDIN
cat eng.20.txt | yali-identifier -c=list.models.2 -i=- -f=all_p

# single file read from STDIN
cat fra.20.txt | yali-identifier -c=list.models.2 -f=all_p

Modules

Language Identification

use Lingua::YALI::LanguageIdentifier;

// create identifier and register languages
my $identifier = Lingua::YALI::LanguageIdentifier->new();
$identifier->add_language("ces", "eng")

// identify string
my $result = $identifier->identify_string("CPAN, the Comprehensive Perl Archive Network, is an archive of modules written in Perl.");
print "The most probable language is " . $result->[0]->[0] . ".\n";
// prints out The most probable language is eng.    

Training models

use Lingua::YALI::Builder;
use Lingua::YALI::Identifier;

// create models
my $builder_a = Lingua::YALI::Builder->new(ngrams=>[2]);
$builder_a->train_string("aaaaa aaaa aaa aaa aaa aaaaa aa");
$builder_a->store("model_a.2_all.gz", 2);

my $builder_b = Lingua::YALI::Builder->new(ngrams=>[2]);
$builder_b->train_string("bbbbbb bbbb bbbb bbb bbbb bbbb bbb");
$builder_b->store("model_b.2_all.gz", 2);

// create identifier and load models
my $identifier = Lingua::YALI::Identifier->new();
$identifier->add_class("a", "model_a.2_all.gz");
$identifier->add_class("b", "model_b.2_all.gz");

// identify strings
my $result1 = $identifier->identify_string("aaaaaaaaaaaaaaaaaaa");
print $result1->[0]->[0] . "\t" . $result1->[0]->[1];
// prints out a 1

my $result2 = $identifier->identify_string("bbbbbbbbbbbbbbbbbbb");
print $result2->[0]->[0] . "\t" . $result2->[0]->[1];
// prints out b 1

AUTHOR

Martin Majlis <martin@majlis.cz>

COPYRIGHT AND LICENSE

This software is Copyright (c) 2012 by Martin Majlis.

This is free software, licensed under:

The (three-clause) BSD License