NAME

Encode - character encodings

TERMINOLOGY

  • char: a character in the range 0..maxint (at least 2**32-1)

  • byte: a character in the range 0..255

The marker [INTERNAL] marks Internal Implementation Details, in general meant only for those who think they know what they are doing, and such details may change in future releases.

bytes

  • bytes_to_utf8(STRING [, FROM])

    The bytes in STRING are recoded in-place into UTF-8. If no FROM is specified the bytes are expected to be encoded in US-ASCII or ISO 8859-1 (Latin 1). Returns the new size of STRING, or undef if there's a failure.

    [INTERNAL] Also the UTF-8 flag of STRING is turned on.

  • utf8_to_bytes(STRING [, TO [, CHECK]])

    The UTF-8 in STRING is decoded in-place into bytes. If no TO encoding is specified the bytes are expected to be encoded in US-ASCII or ISO 8859-1 (Latin 1). Returns the new size of STRING, or undef if there's a failure.

    What if there are characters > 255? What if the UTF-8 in STRING is malformed? See "Handling Malformed Data".

    [INTERNAL] The UTF-8 flag of STRING is not checked.

chars

  • chars_to_utf8(STRING)

    The chars in STRING are encoded in-place into UTF-8. Returns the new size of STRING, or undef if there's a failure.

    No assumptions are made on the encoding of the chars. If you want to assume that the chars are Unicode and to trap illegal Unicode characters, you must use from_to('Unicode', ...).

    [INTERNAL] Also the UTF-8 flag of STRING is turned on.

    • utf8_to_chars(STRING)

      The UTF-8 in STRING is decoded in-place into chars. Returns the new size of STRING, or undef if there's a failure.

      If the UTF-8 in STRING is malformed undef is returned, and also an optional lexical warning (category utf8) is given.

      [INTERNAL] The UTF-8 flag of STRING is not checked.

    • utf8_to_chars_check(STRING [, CHECK])

      (Note that special naming of this interface since a two-argument utf8_to_chars() has different semantics.)

      The UTF-8 in STRING is decoded in-place into chars. Returns the new size of STRING, or undef if there is a failure.

      If the UTF-8 in STRING is malformed? See "Handling Malformed Data".

      [INTERNAL] The UTF-8 flag of STRING is not checked.

chars With Encoding

  • chars_to_utf8(STRING, FROM [, CHECK])

    The chars in STRING encoded in FROM are recoded in-place into UTF-8. Returns the new size of STRING, or undef if there's a failure.

    No assumptions are made on the encoding of the chars. If you want to assume that the chars are Unicode and to trap illegal Unicode characters, you must use from_to('Unicode', ...).

    [INTERNAL] Also the UTF-8 flag of STRING is turned on.

  • utf8_to_chars(STRING, TO [, CHECK])

    The UTF-8 in STRING is decoded in-place into chars encoded in TO. Returns the new size of STRING, or undef if there's a failure.

    If the UTF-8 in STRING is malformed? See "Handling Malformed Data".

    [INTERNAL] The UTF-8 flag of STRING is not checked.

  • bytes_to_chars(STRING, FROM [, CHECK])

    The bytes in STRING encoded in FROM are recoded in-place into chars. Returns the new size of STRING in bytes, or undef if there's a failure.

    If the mapping is impossible? See "Handling Malformed Data".

  • chars_to_bytes(STRING, TO [, CHECK])

    The chars in STRING are recoded in-place to bytes encoded in TO. Returns the new size of STRING in bytes, or undef if there's a failure.

    If the mapping is impossible? See "Handling Malformed Data".

  • from_to(STRING, FROM, TO [, CHECK])

    The chars in STRING encoded in FROM are recoded in-place into TO. Returns the new size of STRING, or undef if there's a failure.

    If mapping between the encodings is impossible? See "Handling Malformed Data".

    [INTERNAL] If TO is UTF-8, also the UTF-8 flag of STRING is turned on.

Testing For UTF-8

  • is_utf8(STRING [, CHECK])

    [INTERNAL] Test whether the UTF-8 flag is turned on in the STRING. If CHECK is true, also checks the data in STRING for being well-formed UTF-8. Returns true if successful, false otherwise.

Toggling UTF-8-ness

  • on_utf8(STRING)

    [INTERNAL] Turn on the UTF-8 flag in STRING. The data in STRING is not checked for being well-formed UTF-8. Do not use unless you know that the STRING is well-formed UTF-8. Returns the previous state of the UTF-8 flag (so please don't test the return value as not success or failure), or undef if STRING is not a string.

  • off_utf8(STRING)

    [INTERNAL] Turn off the UTF-8 flag in STRING. Do not use frivolously. Returns the previous state of the UTF-8 flag (so please don't test the return value as not success or failure), or undef if STRING is not a string.

UTF-16 and UTF-32 Encodings

  • utf_to_utf(STRING, FROM, TO [, CHECK])

    The data in STRING is converted from Unicode Transfer Encoding FROM to Unicode Transfer Encoding TO. Both FROM and TO may be any of the following tags (case-insensitive, with or without 'utf' or 'utf-' prefix):

    tag             meaning
    
    '7'             UTF-7
    '8'             UTF-8
    '16be'          UTF-16 big-endian
    '16le'          UTF-16 little-endian
    '16'            UTF-16 native-endian
    '32be'          UTF-32 big-endian
    '32le'          UTF-32 little-endian
    '32'            UTF-32 native-endian

    UTF-16 is also known as UCS-2, 16 bit or 2-byte chunks, and UTF-32 as UCS-4, 32-bit or 4-byte chunks. Returns the new size of STRING, or undef is there's a failure.

    If FROM is UTF-8 and the UTF-8 in STRING is malformed? See "Handling Malformed Data".

    [INTERNAL] Even if CHECK is true and FROM is UTF-8, the UTF-8 flag of STRING is not checked. If TO is UTF-8, also the UTF-8 flag of STRING is turned on. Identical FROM and TO are fine.

Handling Malformed Data

If CHECK is not set, undef is returned. If the data is supposed to be UTF-8, an optional lexical warning (category utf8) is given. If CHECK is true but not a code reference, dies. If CHECK is a code reference, it is called with the arguments

(MALFORMED_STRING, STRING_FROM_SO_FAR, STRING_TO_SO_FAR)

Two return values are expected from the call: the string to be used in the result string in place of the malformed section, and the length of the malformed section in bytes.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 135:

You forgot a '=back' before '=head2'