The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Lingua::Lid - Interface to the language and encoding identifier "lid"

SYNOPSIS

use Lingua::Lid qw/:all/;
 
# Identify the language and character encoding of...
 
# ...a string
$result = lid_fstr("This is a short English sentence.");
 
# ...a plain text file
$result = lid_ffile("/path/to/a/file.txt");
 
print "Lingua::Lid v$Lingua::Lid::VERSION, using lid v",
    lid_version(), "\n";

DESCRIPTION

The Perl extension Lingua::Lid provides a Perl interface to Lingua-Systems' language and character encoding identification library lid, which is required to build and use this extension.

The interface is implemented using the XS language and makes the functionality of the lid C library functions available to Perl applications and modules in a simple to use way.

This man page covers the usage of the Lingua::Lid Perl extension only - for more information on lid and a list on supported languages and character encodings, have a look at its manual, which is both included in its distribution and freely available under http://www.lingua-systems.com/language-identifier/lid-library/.

Lingua::Lid aims to stick with the C interface as close as reasonable - but with respect to common Perl conventions. Have a look at "COMPARISON TO THE C INTERFACE" for details.

EXPORTS

No symbols are exported by default.

Any function needed must either be requested for import explicitly or the export tag :all may be used to import symbols for all provided functions:

use Lingua::Lid qw/lid_ffile lid_fstr/; # or
use Lingua::Lid qw/:all/;

FUNCTIONS

lid_fstr( $string )

Mnemonic: "Language and encoding identification... from string"

This function takes a $string as an argument and identifies its language and encoding. It returns a hash reference containing the results. See IDENTIFICATION RESULTS DATA STRUCTURE for details.

If an error occurs, the function returns undef and sets $Lingua::Lid::errstr to an appropriate message describing the error.

lid_ffile( $file )

Mnemonic: "Language and encoding identification... from file"

This function takes a plain text $file's path as an argument and identifies its language and encoding. It returns a hash reference containing the results. See IDENTIFICATION RESULTS DATA STRUCTURE for details.

If an error occurs, the function returns undef and sets $Lingua::Lid::errstr to an appropriate message describing the error.

lid_version( )

This function returns the version of the underlying lid C library.

IDENTIFICATION RESULTS DATA STRUCTURE

The functions lid_fstr() and lid_ffile() return a hash reference containing the results of the language and encoding identification.

The hash reference contains the following keys:

language

The language's name (in English), i.e. "German", "French", "English".

isocode

The language's ISO 639-3 code, i.e. "deu", "fra", "eng".

encoding

The character encoding, i.e. "UTF-8", "ISO-8859-1", "UTF-32BE".

$result = {
              'language'  =>  'English',
              'isocode'   =>  'eng',
              'encoding'  =>  'ASCII'
          };

ERROR HANDLING

The functions lid_fstr() and lid_ffile() return undef if an error occurs and set Lingua::Lid's package variable $errstr ($Lingua::Lid::errstr) to an appropriate message describing the error.

Have a look at lid's manual for a list of all error messages.

NOTE:

The $Lingua::Lid::errstr variable is reset to undef whenever lid_fstr() or lid_ffile() are called.

COMPARISON TO THE C INTERFACE

Lingua::Lid's function lid_fstr() and lid_ffile() behave exactly as their lid counterparts in C.

The C functions lid_fnstr() and lid_fwstr() are not needed, use the Lingua::Lid function lid_fstr() in any Perl code instead.

The C function lid_strerror() and the global C variable lid_errno are not needed. Rather than returning a pointer to NULL, Lingua::Lid's lid_fstr() and lid_ffile() return undef on errors and set $Lingua::Lid::errstr to an appropriate message describing the error.

The C define LID_VERSION is not available in Lingua::Lid, use lid_version() instead.

Lingua::Lid's results data structure sticks to the C lid_t * structure as close as possible. See "IDENTIFICATION RESULTS DATA STRUCTURE" above.

EXAMPLES

use strict;
use Lingua::Lid qw/lid_fstr lid_version/;
 
print "Lingua::Lid v$Lingua::Lid::VERSION, using lid v",
  lid_version(), "\n";
 
my @strings =
(
    "This is a short English sentence.",
    "Dies ist ein kurzer deutscher Satz.",
    "Too short."
);
 
foreach my $string (@strings)
{
    if (my $r = lid_fstr($string))
    {
        print join(" - ", $r->{language}, $r->{isocode},
                          $r->{encoding}), "\n";
    }
    else
    {
        print "lid_fstr() failed: $Lingua::Lid::errstr\n";
    }
}

The program above produces the following output:

Lingua::Lid v0.01, using lid v2.0.2
English - eng - ASCII
German - deu - ASCII
lid_fstr() failed: Insufficient input length

BUGS

None known.

Please report bugs either using CPAN's bug tracker or to <perl@lingua-systems.com>.

SEE ALSO

AUTHOR

Alex Linke, <alinke@lingua-systems.com>

COPYRIGHT AND LICENSE

Copyright (C) 2009 Lingua-Systems Software GmbH

This extension is free software. It may be used, redistributed and/or modified under the terms of the zlib license. For details, see the full text of the license in the file LICENSE.