NAME

docs/pdds/pdd28_character_sets.pod - Strings and character sets

ABSTRACT

This PDD describes the conventions expected for users of Parrot strings, including but not limited to support for multiple character sets, encodings and languages.

VERSION

$Revision: 25243 $

DESCRIPTION

Here is a summary of the design decisions described in this PDD.

  • Parrot supports multiple string formats, and so users of Parrot strings must be aware at all times of string encoding issues and how these relate to the string interface.

  • The native Parrot string format is an array of 32-bit Unicode codepoints in grapheme normalization form. (NFG)

  • NFG is defined as a normalization which allocates at most one codepoint to each visible character.

  • An interface is defined for interacting with Parrot strings and converting between character sets and encodings.

Encoding awareness

Parrot was designed from the outset to support multiple string formats. Unlike other such projects, we don't standardize on Unicode internally. This is because for the majority of use cases, it's still far more efficient to deal with whatever input data the user sends us, which, equally in the majority of use cases, is something like ASCII - or at least, some kind of byte-based rather than character-based encoding.

So internally, consumers of Parrot strings have to be aware that there is a plurality of string encodings going on inside Parrot. (Producers of Parrot strings can do whatever is most efficient for them.) The implications of this for the internal API will be detailed in the implementation section below, but to put it in simple terms: if you find yourself writing *s++ or any other C string idioms, you need to stop and think if that's what you really mean. Not everything is byte-based any more.

However, we're going to try to make it as easy for *s++-minded people as possible, and part of that is the declaration of a Parrot native string format. You don't have to use it, but if you do all your dreams will come true.

Native string format

Dealing with variable-byte encodings is not fun; for instance, you need to do a bunch of computations every time you traverse a string. In order to make programming a lot easier, we define a Parrot native string format to be an array of unsigned 32-bit Unicode codepoints. This is equivalent to UCS-4 except for the normalization form semantics described below.

This means that if you've done the necessary checks, and hence you know you're dealing with a Parrot native string, then you can continue to program in the usual C idioms - for the most part. Of course you'll need to be careful with your comparisons, since what you'll be getting back will be a Parrot_UInt4 instead of a char.

Grapheme normalization form

Unicode characters can be expressed in a number of different ways according to the Unicode Standard. This is partly to do with maintaining compatibility with existing character encodings. For instance, in Serbo-Croatian and Slovenian, there's a letter which looks like an i without the dot but with two grave (`) accents. If you have an especially good POD renderer, you can see it here: ȉ.

There are two ways you can represent this in Unicode. You can use character 0x209, also known as LATIN SMALL LETTER I WITH DOUBLE GRAVE, which does the job all in one go. This is called a "composed" character, as opposed to its equivalent decomposed sequence: LATIN SMALL LETTER I (0x69) followed by COMBINING DOUBLE GRAVE ACCENT (0x30F).

Unicode standardises in a number of "normalization forms" which repesentation you should use. We're using an extension of Normalization Form C, which says basically, decompose everything, then re-compose as much as you can. So if you see the integer stream 0x69 0x30F, it needs to be replaced by 0x209. This means that Parrot string data structures need to keep track of what normalization form a given string is in, and Parrot must provide functions to convert between normalization forms.

Now, Serbo-Croat is sometimes also written with Cyrillic letters rather than Latin letters. The Cyrillic equivalent of the above character is not part of Unicode, but would be specified as a decomposed pair CYRILLIC SMALL LETTER I (0x438) COMBINING DOUBLE GRAVE ACCENT (0x30F). (This PDD does not require Parrot to convert strings between differing political sensibilities.) However, it is still visible as one character and despite being expressed even in NFC as two characters, is still a single character as far as a human reader is concerned.

Hence we introduce the distinction between a "character" and a "grapheme". This is a Parrot distinction - it does not exist in the Unicode Standard.

When a regular expression engine from one of Parrot's target languages wishes to match a grapheme, then NFC is clearly not normalized enough. This is why we have defined a further normalization stage, NFG - Normalization Form for Graphemes.

NFG uses out-of-band signalling in the string to refer the conforming implementation to a decomposition table. UCS-4 specifies an encoding for Unicode codepoints from 0 to 0x7FFFFFFF. In other words, any codepoints with the first bit set are undefined. We define these out-of-band codepoints as indexes into a lookup table, which maps between a temporary ID and its associated decomposition.

In practice, this goes as follows: Assuming our Russified Serbo-Croat string is the first string that Parrot sees, when it is converted to Parrot's default format, it would be normalized to a single character having the codepoint 0x80000000. At the same time, Parrot would insert an entry into a temporary array at array index 0, consisting of the bytestream 0x00000438 0x000000030F - that is, the Unicode decomposition of the grapheme.

This has one big advantage: applications which don't care about graphemes can just pass the codepoint around as if it's any other number - uh, character. Only applications which care about the specific properties of Unicode characters need to take the overload of peeking inside the array and reading the decomposition.

Individual languages may need to think carefully about their concept of, for instance, "the length of a string" to determine whether or not they need to visit the lookup table for these strings. At any rate, Parrot should provide both grapheme-aware and codepoint-aware iterators for string traversal.

IMPLEMENTATION

Changes required to current string implementation

String access API

Normalization form

String encoding API

String programming checklist

REFERENCES

http://plan9.bell-labs.com/sys/doc/utf.html - Plan 9's Runes are not dissimilar to Parrot's integer codepoints, and this is a good introduction to the Unicode world.

http://www.unicode.org/reports/tr15/ - The Unicode Consortium's explanation of different normalization forms.

"Unicode: A Primer", Tony Graham - Arguably the most readable book on how Unicode works.