NAME
Unicode::UTF8 - Encoding and decoding of UTF-8 encoding form.
SYNOPSIS
use Unicode::UTF8 qw[decode_utf8 encode_utf8];
$string = decode_utf8($octets);
$octets = encode_utf8($string);
DESCRIPTION
This module provides functions to encode and decode UTF-8 encoding form as specified by Unicode and ISO/IEC 10646:2011.
Here is a summary of features for comparison with Encode:
simple API which integrates with warnings category
utf8
recognizes all noncharacters regardless of perl version
implements Unicode's recommended practice for using U+FFFD
helpful diagnostics in warnings messages
detects and reports inconsistency in perl's internal encoding (UTF-X)
preserves taintedness of decoded
$octets
or encoded$string
better performance ~ 600% - 1200% (JA: 600%, AR: 700%, SV: 900%, EN: 1200%, see benchmarks directory in git repository)
FUNCTIONS
decode_utf8($octets [, $fallback])
Returns an decoded representation of $octets
in UTF-8 encoding as a character string.
Issues a warning using warnings category utf8
if $octets
contains ill-formed UTF-8 sequences or encoded code points which can't be interchanged.
$fallback
is a CODE
reference which provides a error-handling mechanism, allowing customization of error handling. The default error-handling mechanism is to replace any ill-formed UTF-8 sequences or encoded code points which can't be interchanged with REPLACEMENT CHARACTER (U+FFFD).
$string = $fallback->($octets, $is_usv);
$fallback
is invoked with two arguments, $octets
and $is_usv
. $octets
constains a sequence of one or more octets containing the maximal subpart of the ill-formed subsequence or encoded code point which can't be interchanged. $is_usv
is a boolean indicating whether or not $octets
represent a encoded Unicode scalar value. $fallback
must return a character string consisting of zero or more Unicode characters.
encode_utf8($string [, $fallback])
Returns an encoded representation of $string
in UTF-8 encoding as an octet string.
Issues a warning using warnings category utf8
if $string
contains code points which can't be interchanged or represented in UTF-8 encoding form.
$fallback
is a CODE
reference which provides a error-handling mechanism, allowing customization of error handling. The default error-handling mechanism is to replace any code points which can't be interchanged or represented in UTF-8 encoding form with REPLACEMENT CHARACTER (U+FFFD).
$string = $fallback->($codepoint, $is_usv);
$fallback
is invoked with two arguments, $codepoint
and $is_usv
. $codepoint
is a unsigned integer containing the code point which can't be interchanged or represented in UTF-8 encoding form. $is_usv
is a boolean indicating whether or not $codepoint
is a Unicode scalar value. $fallback
must return a character string consisting of zero or more Unicode characters.
EXPORTS
None by default. All functions can be exported using the :all
tag or individually.
DIAGNOSTICS
- Can't decode a wide character string
-
(F) Wide character in octets.
- Can't decode ill-formed UTF-8 octet sequence <%s> in position %u
-
(W utf8) Encountered an ill-formed UTF-8 octet sequence. <%s> contains a hexadecimal representation of the maximal subpart of the ill-formed subsequence.
- Can't interchange noncharacter code point U+%.4X at position %u
-
(W utf8) Noncharacters is permanently reserved for internal use and that should never be interchanged. Noncharacters consist of the values U+nFFFE and U+nFFFF (where n is from 0 to 10^16) and the values U+FDD0..U+FDEF.
- Can't represent surrogate code point U+%.4X at position %u in UTF-8 encoding form
-
(W utf8) Surrogate code points are designated only for surrogate code units in the UTF-16 character encoding form. Surrogates consist of code points in the range U+D800 to U+DFFF.
- Can't represent super code point \x{%X} at position %u in UTF-8 encoding form
-
(W utf8) Code points greater than U+10FFFF. Perl's extended codespace.
- Can't decode ill-formed UTF-X octet sequence <%s> in position %u
-
(F) Encountered an ill-formed octet sequence in Perl's internal representation of wide characters.
SEE ALSO
SUPPORT
BUGS
Please report any bugs by email to bug-unicode-utf8 at rt.cpan.org
, or through the web interface at http://rt.cpan.org/Public/Dist/Display.html?Name=Unicode-UTF8. You will be automatically notified of any progress on the request by the system.
SOURCE CODE
This is open source software. The code repository is available for public review and contribution under the terms of the license.
http://github.com/chansen/p5-unicode-utf8
git clone http://github.com/chansen/p5-unicode-utf8
AUTHOR
Christian Hansen chansen@cpan.org
COPYRIGHT
Copyright 2011 by Christian Hansen.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.