NAME

Encode::Detect::Upload - Attempt to guess user's locale encoding from IP, HTTP_ACCEPT_LANGUAGE and HTTP_USER_AGENT

SYNOPSIS

use Encode::Detect::Upload;
my $detector = new Encode::Detect::Upload;
# Feelin lucky!
my $charset = $detector->detect();
# More sensible
my ( $charset_list, $meta ) = $detector->detect();

DESCRIPTION

Dealing with input from globally disperse users can be a real pain. Although when setting web forms to utf-8 browsers will often do the right thing, in some instances, such as text file uploads, you are stuck will trying to figure out the files charset encoding. Encode::Detect::Detector uses Mozilla's universal charset detector, which works great most of the time. But when it doesn't your stuck with asking the user, a user that all to often these days has a very low technical ability, and likely doesn't know what a charset it.

In my experience with dealing with such user uploads, the charset of the file usually relates to the users OS, location and language settings. Although it's true that the file could have any encoding, the file could have been created on a different machine, with a different locale to the one that is doing the upload. But the use of this modules techniques along with that of Encode::Detect::Detector more cases can be handled correctly. Methods for helping the user chose encoding are also provided.

Methods

new()
new(\%params)
new(%params)

Returns a new detection object. Parameters may be passed either as key/value pairs or as a hash references. The following parameters are recognised:

die_on_missing    Whether missing method parameters cause fatal errors (default: true)
get_os()
get_os($user_agent_string)

Extracts the operating system name from the supplied User-Agent header value, or $ENV{HTTP_USER_AGENT} if not supplied. Dies if no user agent string is available. Returns either Windows, Linux, Macintosh or undefined if no match was made.

get_country()
get_country($ip_address)
get_country($ip_address,$geo_ip_data_filename)

Looks up the user's country from the supplied IP address, or $ENV{REMOTE_ADDR} by default. Dies if neither of IP::Country or Geo::IP is installed. Returns the ISO 2 character country code.

get_country_lang($iso_2code)

Returns the language tag(s) associated with the supplied country code. In scalar context returns the primary language tag; in list context returns all associated language tags. Dies if the supplied country code is undefined. Returns undef if no matching country is found.

Language tags are defined in section 3.10 or RFC 2616, and can be 2 or 3 letters, optionally followed by a series of subtags, separated by dashes.

get_country_name($iso_2code)

Returns the name of the country specified by the suppied 2 letter code. Dies if no country is specified.

get_accept_lang()
get_accept_lang($accept_lang_string)

Returns the accepted language tag(s) described by the supplied Accept-Language header value, or from $ENV{HTTP_ACCEPT_LANGUAGE} if not supplied. Dies if no header value is available. In scalar context, returns the first language tag listed. In list context, returns all tags, in the order they are listed in the header value.

get_lang_name($language_code)

Returns the name of the language specified by the supplied 2 or 3 letter ISO-639 language code. Dies if no language code is supplied.

get_lang_list($language_tag)

Returns the list of language tags which could be used for matching the supplied language tag. This will always include the supplied language tag. If the supplied tag includes a cyrl or latn subtag, or is a primary tag for which cyrl or latn subtags are available, all such subtags will be returned. If the supplied tag contains any subtags, the primary tag will also be returned. Dies is no language tag is supplied.

get_lang_charset($language_tag)
get_lang_charset($language_tag, $os_name)

Returns the charset(s) used by the supplied language. If an operating system name is supplied, treats its character sets preferentially. Dies if no language tag is supplied. In scalar context, returns the best matching charset. In list context, returns a list of all suitable charsets.

get_words($sample_string)
get_words($sample_string, $max_words)

Returns a list of unique words from the supplied sample string which contain non-ASCII characters. Returns no more than the specified maximum number or words, which defaults to 10. Dies if no sample text is supplied.

detect(%params)
detect(\%params)

Determines the encoding of the supplied text. In scalar context, returns the most likely charset code. In list context returns an arrayref of charset codes, ordered from most to least likely, and a hashref of metadata. Dies if any required parameters are not supplied. The following parameters are accepted:

text          Text to determine the encoding of (required)
words         Maximum number of words to examine (default=10)
ip            User's IP address (default=$ENV{REMOTE_ADDR})
accept_lang   Accept-Language header value (required, default=$ENV{HTTP_ACCEPT_LANGUAGE})
inc_linux     Include Linux charsets? (default=0)
ranking       TODO document
os            OS name (Windows, Macintosh or Linux)
user_agent    User-Agent header value (required if os not supplied,
                  default=$ENV{HTTP_USER_AGENT})
lang          Language tag or arrayref thereof
country       Country code or arrayref thereof (required if lang not supplied)
country_extra TODO document
lang_extra    TODO document

Requires a sample text string. Can optionally be passed the number of words to try to match (default 10), the users IP, the users OS, the user_agent string, the language code(S), the accept_language string, whether linux charsets should be included, and for advanced use you can adjust the way languages and charsets are ranked. Returns either a single charset (in scalar context) or a list of charsets ordered by most likely with associated meta data. If Encode::Detect::Detector is available it's guess is used to improve accuracy.

For discussion of ranking heuristics and how to adjust them, see the section below.

# I'm feeling lucky
my $charset = $detector->detect();

# I'm feeling realistic
my ( $charset_list, $charset_meta ) = $detector->detect( text => '...' );

# Data structure example
$charset_list = [ 'x-mac-cyrillic', 'x-mac-ce', 'windows-1251', 'x-mac-ukrainian'... ];
$charset_meta = {
    charsets => {
        'x-mac-cyrillic' => {
            pos => 1, # Ranking position
            words => [ 'Здравствуй', ... ], # Sample word list
            lang => [ 'ru', ... ], # Language tags that led to this charset
        },
        'x-mac-ce' => {
            pos => 2,
            words => [ 'ášūŗ‚ŮÚ‚ůť', ... ],
            lang => [ 'sr', ... ],
        },
        'windows-1251' => {
            pos => 3,
            words => [ '‡дравствуй', ... ],
            lang => [ 'ru', ... ],
            mozilla => 1, # In this example mozilla guessed wrong
        },
        ...
    },
    lang => {
        ru => {
            name    => 'Russian', # Language name
            both    => 1, # Matched from both country and accept_lang
            country => 1, # Matched from country (IP)
            accept  => 1, # Matched from accept_lang
            pos     => 1, # Ranking position
        },
        ...
    },
    country  => {
        name => 'Russia',
        tag  => 'ru',
    },
    error => [ 'utf-8', ... ], # Text wouldn't parse as utf-8
}

RANKING SYSTEM

Unfortunately the heuristics employed by this method aren't straight forward. Several key scenarios are taken into consideration, namely:

The upload charset is: for the language that matches the browsers language settings and OS. for the language that matches the uploaders countries official language and OS. for the language that matches the browsers language settings, but a different OS. for the language that matches the uploaders countries official language, but a different OS. unrelated, hopefully detected by Mozilla's universal charset detector.

Although the browsers language setting is preferred, it's not unusually for it to be incorrect. For example a surprising number of UK users have en-US rather than en-GB. In such instances the language from the IP would be more accurate. For this reason if the Mozilla detected charset matches an IP dervied charset it is brought to the front. However, an Englishman uploading a file whilst abroad would not give an accurate language from IP. Likewise, some countries like South Africa have several recognised languages. Some countries have inhabitants that use either Latin or Cyrillic alphabets for the same language. In these instances, the Mozilla detector is used to determine which is more likely, but both options will be returned. The use of Macintosh computers has been on the rise, as has the appearance of their charsets. In fact that's what led me to write this module, as the Mozilla detector doesn't cover every encoding and was missing Mac-Roman. Generally Windows users are less likely to upload files with Macintosh encoding, Although the same cannot be said the other way around. For this reason, when the OS is Macintosh it's matching charsets will come first, followed by the likely Windows, alternating between the two.

We assume linux systems are mostly UTF-8 these days, that their pre-UTF-8 ISO charsets were roughly the same as the Windows equivalents, and that Linux users are generally more computer savvy. For these reasons Linux charsets are not included in results by default.

Rather than ranking charsets through some kind of weighting based on appearance, we apply configurable patterns. Weight would always favour common charsets, hopefully the ranking patterns work better.

This is the first version of this module. I'm open to suggestions with regards improved heuristics, and possibly configurable heuristics.

You can override the default ranking by passing the appropriate data structure to detect(). You need to at least provide the repeat string for lang and all the OSs.

IP country lookup and accept_language parsing is used initially to generate a list of matching languages. The order in which these are then ranked is based on their appearance (accept_lang), or popularity (country), and the sequence given. A represents accept_lang and C represents country, so a sequence starting with AC and repeating with AC would generate ACACACACAC... until there are no matching languages left. The lang_both option pushes charsets that come from both accept_lang and country.

Next charsets are matched from the languages by OS. Depending on what OS has been passed, or detected from user_agent. The char sequences contain W for Windows, M for Macintosh or L for Linux. The Linux charsets are filtered out unless the OS is Linux or the inc_linux config option is enabled. So a Windows OS with sequence starting WW and repeating WML would generate WWWMWMWMWM... matching the first 3 likely windows charsets, then the most likely Macintosh, etc. Charsets are tested to see if they can decode the text, invalid ones are filtered out.

The string is tested to see whether it looks like UTF-8. If it does that's pushed to the front on the list. If the Mozilla charset detector is available it's used to see what charset it returns. The option mozilla_move sets how the many places to move the matching charset forward in the list. The mozilla_insert options defines in what position to insert the Mozilla match if it's not already in the list.

my %ranking = (
    lang => {
        start  => 'AC',
        repeat => 'AC',
    },
    # Rank languages that appear in both country and accept_lang first
    lang_both => 1,
    char => {
        windows => {
            start  => 'WW',
            repeat => 'WML',
        },
        macintosh => {
            start  => 'M',
            repeat => 'MWL',
        },
        linux => {
            start  => 'LWM',
            repeat => 'LWM',
        },
    },
    # Mozilla detected charset options
    mozilla_move => 1, # Number of positions to move the forward
    mozilla_insert => 3, # At what position to insert if it's not in list
);
my $charset = $detector->detect( ranking => \%ranking );

LICENSE

This is released under the Artistic License. See perlartistic.

AUTHORS

Lyle Hopkins - http://www.cosmicperl.com/

Peter Haworth - http://pmh1wheel.org/

Development kindly sponsored by - http://www.greenrope.com/

REFERENCES

I had a hard time finding good data sources, all the information I needed was pretty spread out. These are the main sites I used, but there was lots of googling to fill in the gaps.

http://www.science.co.il/language/locale-codes.asp http://www.mydigitallife.info/ansi-code-page-for-windows-system-locale-with-identifier-constants-and-strings/ http://webcheatsheet.com/html/character_sets_list.php http://www.w3.org/International/O-charset-lang.html http://www.eki.ee/itstandard/docs/draft-alvestrand-lang-char-03.txt http://tlt.its.psu.edu/suggestions/international/bylanguage/index.html http://docs.oracle.com/javase/1.5.0/docs/guide/intl/locale.doc.html http://www-archive.mozilla.org/projects/intl/chardet.html http://download.geonames.org/export/dump/countryInfo.txt

SEE ALSO

Encode::Detect::Detector, Encode, Geo::IP, IP::Country

TODO

Make default between Latin and Cyrillic based on popularity in language Write some tests Rank regions differently? Generalize environment examination, defaults only at detect() or confgurable thru the detector object itself