NAME

Search::Tools::UTF8 - UTF8 string wrangling

SYNOPSIS

use Search::Tools::UTF8;

my $str = 'foo bar baz';

print "bad UTF-8 sequence: " . find_bad_utf8($str)
   unless is_valid_utf8($str);

print "bad ascii byte at position " . find_bad_ascii($str)
   unless is_ascii($str);

print "bad latin1 byte at position " . find_bad_latin1($str)
   unless is_latin1($str);

DESCRIPTION

Search::Tools::UTF8 supplies common UTF8-related functions.

FUNCTIONS

byte_length( text )

Returns the number of bytes in text regardless of encoding.

is_valid_utf8( text )

Returns true if text is a valid sequence of UTF-8 bytes, regardless of how Perl has it flagged (is_utf8 or not).

is_ascii( text )

If text contains no bytes above 127, then returns true (1). Otherwise, returns false (0). Used by convert() internally to check text prior to transliterating.

is_latin1( text )

Returns true if text lies within the Latin1 charset.

NOTE: Only Latin1 octets with a valid representable character are checked. Octets in the range \x80 - \x9f are not considered valid Latin1 and if found in text, is_latin1() will return false.

CAUTION: A string of bytes can be both valid Latin1 and valid UTF-8, even though the string doesn't represent the same Unicode codepoint(s). Example:

my $str = "\x{d9}\x{a6}";  # same as \x{666}
is_valid_utf8($str);       # returns true
is_latin1($str);           # returns true

Thus is_latin1() (and likewise find_bad_latin1()) are not foolproof. Use them in combination with is_flagged_utf8() to get a better test.

is_flagged_utf8( text )

Returns true if Perl thinks text is UTF-8. Same as Encode::is_utf8().

is_perl_utf8_string( text )

Wrapper around the native Perl is_utf8_string() function. Called by is_valid_utf8().

is_sane_utf8( text [,warnings] )

Will test for double-y encoded text. Returns true if text looks ok. From Text::utf8 docs:

Strings that are not utf8 always automatically pass.

Pass a second true param to get diagnostics on stderr.

find_bad_utf8( text )

Returns string of bad bytes from text. This of course assumes that text is not valid UTF-8, so use it like:

croak "bad bytes: " . find_bad_utf8($str) 
   unless is_valid_utf8($str);
   

If text is a valid UTF-8 string, returns undef.

find_bad_ascii( text )

Returns position of first non-ASCII byte or -1 if text is all ASCII.

find_bad_latin1( text )

Returns position of first non-Latin1 byte or -1 if text is valid Latin1.

find_bad_latin1_report( text )

Returns position of first non-Latin1 byte (like find_bad_latin1()) and also carps about what the decimal and hex values of the bad byte are.

to_utf8( text, charset )

Shorthand for running text through appropriate is_*() checks and then converting to UTF-8 if necessary. Returns text encoded and flagged as UTF-8.

Returns undef if for some reason the encoding failed or the result did not pass is_sane_utf8().

looks_like_cp1252( text )

This function tests that there are bytes in text between 0x80 and 0x9f inclusive. Those bytes are used by the Windows-1252 character set and include some of the troublesome characters like curly quotes.

See also fix_cp1252_codepoints_in_utf8() and the Search::Tools::Transliterate convert1252() method.

fix_cp1252_codepoints_in_utf8( text )

The Windows-1252 codepoints between 0x80 and 0x9f may be encoded validly as UTF-8 but the Unicode standard does not map any characters at those codepoints. fix_cp1252_codepoints_in_utf8() converts a UTF-8 encoded string text to map the suspect 1252 codepoints to their correct Unicode representations.

Note that fix_cp1252_codepoints_in_utf8() is different from the fix_latin() function used in Transliterate, which does not differentiate between a Windows-1252 encoded string and a UTF-8 encoded string.

This function will croak if text does not pass is_valid_utf8().

debug_bytes( text )

Iterates over each byte in text, printing byte, hex and decimal values to stderr.

AUTHOR

Peter Karman <karman@cpan.org>

Originally based on the HTML::HiLiter regular expression building code, by the same author, copyright 2004 by Cray Inc.

Thanks to Atomic Learning www.atomiclearning.com for sponsoring the development of some of these modules.

Many of the UTF-8 tests come directly from Test::utf8.

BUGS

Please report any bugs or feature requests to bug-search-tools at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Search-Tools. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc Search::Tools

You can also look for information at:

COPYRIGHT

Copyright 2006-2009 by Peter Karman.

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

HTML::HiLiter, SWISH::HiLiter, Class::XSAccessor, Text::Aspell