The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

ODF::lpOD_Helper::Unicode - Unicode and ODF::lpOD (and XML::Twig)

SYNOPSIS

use ODF::lpOD;
use ODF::lpOD_Helper;
use feature 'unicode_strings';

INTRODUCTION

We once thought Unicode forced us to fiddle with bytes to handle "international" characters. That thinking came from low-level languages like C.

Perl saved us but it took years before everyone believed. And more years before Perl's Unicode paradigm was widely understood.

Meanwhile lots of pod and code was written which in hindsight is confused or misleading.

THE PERL UNICODE PARADIGM

1. "Decode" input from binary into Perl characters as soon as possible 
   after getting it from the outside world (e.g. from a disk file).

2. As much of the application as possible works with Perl characters, 
   paying absolutely no attention to encoding.

3. "Encode" Perl characters into binary data as late as possible, 
   just before sending the data out.

See "Not always so tidy" below for more discussion.

ODF::lpOD

For historical reasons ODF::lpOD by default is incompatible with the above paradigm because every method encodes result strings into UTF-8 before returning them to you, and attempts to decode strings you pass in before using them. Therefore you must work with binary octets rather than abstract characters; Regex match, substr(), length(), and comparison with string literals do not work with non-ASCII/latin1 characters. Also, you can't print such already-encoded binary to STDOUT if that file handle auto-encodes because the data will be encoded twice.

use ODF::lpOD_Helper disables ODF::lpOD's internal encoding and decoding. Methods then speak and listen in characters, not octets.

You should also use feature 'unicode_strings'; to avoid problematic aspects of legacy Perl behavior.

IF LEGACY BEHAVIOR IS NEEDED

The import tag :bytes will tell ODF::lpOD_Helper to not disable internal encoding & decoding, i.e. retain the original ODF::lpOD behavior.

It is also possible to toggle between the old and new behavior at run-time: lpod->Huse_octet_strings() will will re-enable implicit decoding/encoding of method arguments (using UTF-8 encoding by default) and lpod->Huse_character_strings() will disable the old behavior and restore transparent Unicode support.

Prior to version 6.000 character mode was not the default, but required a :chars import tag. That tag is now deprecated and produces a warning.

XML::Twig & XML::Parser

XML::Twig's docs wrongly say that it stores strings in memory encoded as UTF-8, implying that a decode operation would be needed to obtain characters; in fact XML::Twig stores strings as characters (the result of the decode done by parse()), and most methods return those Perl characters, not encoded binary.

An exception is XML::Twig's print() and sprint which encode their results. The encoding used is the same as was found when parsing the input file unless you change it, and the generated XML header specifies the encoding used. This makes sense when printing directly to a disk file (via a non-encoding file handle), but will corrupt non-ASCII characters printed to STDOUT if STDOUT is set to auto-encode, as is usually the case.

Going the other way, XML::Twig's parse() method expects binary octets as input and uses the "encoding=..." specification in the XML header to know how to decode the remainder of the file.

Not always so tidy

Perl's data model now has two kinds of strings: binary octets, which might or might not be some representation of Unicode characters, and abstract characters which are opaque atoms. In most cases these two kinds should never be combined, see man perlunifaq.

Decoding means converting binary octets into abstract characters, and encoding does the reverse.

Obviously Perl somehow represents abstract characters internally, and it often (not always) uses a variation of UTF-8; as a result, "decoding" or "encoding" between abstract characters and UTF-8 octet streams can be very fast.

Non-Unicode-aware scripts usually "just work" as long as they never see characters outside the ASCII range. This is because, without a decode operation or literal strings with "wide" characters, Perl treats all strings as "binary"; the behaviour of such strings is usually correct when all characters have single-byte representations.

The "wide character" warning

If a non-ASCII/latin1 character does get into a Perl string which is written to a non-encoding file handle, Perl issues a warning because that should never happen. It means the program is both Unicode-unaware and uses characters which require Unicode processing to work correctly. See man perlunifaq.

Unicode-aware programs must always encode character strings before writing to disk either explicitly or by using an encoding file handle.

Non-unicode-aware programs (which neither decode or encode), will "just work" with ASCII data as described earlier.

(END)