TITLE

Parrot Strings

The Parrot String API

This document describes how Parrot abstracts the programmer's interface to string types. All strings used in the Parrot core should use the Parrot STRING structure; Parrot programmers should not deal with char * or other string-like types outside of this abstraction without very good reason.

Interface functions on STRINGs

In fact, programmers should hardly ever even access members of the STRING structure directly. The reason for this is that the interpretation of the data inside the structure will be a function of the data's encoding. The idea is that Parrot's strings are encoding-aware so your functions don't need to be; if you break the abstraction, you suddenly have to start worrying about what the data actually means.

String Constructors

The most basic way of creating a string is through the function string_make:

STRING* string_make(struct Parrot_Interp *, const void *buffer, INTVAL buflen, INTVAL encoding, INTVAL flags, INTVAL type)

In here you pass a pointer to a buffer of a given encoding, and the number of bytes in that buffer to examine, the encoding, (see below for the enum which defines the different encodings) and the initial values of the flags and type field. These should usually be zero. In return, you'll get a brand new Parrot string. This string will have its own private copy of the buffer, so you don't need to keep it.

  • Hint: Nothing stops you doing

    string_make(interpreter, NULL, 0, ...

If you already have a string, you can make a copy of it by calling

STRING* string_copy(struct Parrot_Interp *, STRING* s)

This is itself implemented in terms of string_make.

String Manipulation Functions

Unless otherwise stated, all lengths, offsets, and so on, are given in characters; you are not allowed to care about the byte representation of a string, so it doesn't make sense to give the values in bytes.

To find out the length of a string, use

INTVAL string_length(const STRING *s)

You may explicitly use s->strlen for this since it is such a useful operation.

To concatenate two strings - that is, to add the contents of string b to the end of string a, use:

STRING* string_concat(struct Parrot_Interp *, STRING* a, STRING *b, INTVAL flag)

a is updated, and is also returned as a convenience. If the flag is set to a non-zero value, then b will be transcoded to a's encoding before concatenation if the strings are of different encodings. You almost certainly don't want to stick, say, a UTF-32 string on the end of a Big-5 string.

To repeat a string, (ie, turn 'xyz' into 'xyzxyzxyz') use:

STRING* string_repeat(struct Parrot_Interp *, const STRING* s, UINTVAL n, STRING** d)

Which will repeat string s n times and store the result into d, which it also returns. If *d or **d is NULL, a new string will be allocated to hold the result. s is not modified by this operation. If d is not of the same type as s, it will be upgraded appropiately.

Chopping n characters off the end of a string is achieved with the unlikely-sounding

STRING* string_chopn(STRING* s, INTVAL n)

To retrieve a substring of the string, call

STRING* string_substr(struct Parrot_Interp *, STRING* src, INTVAL offset, INTVAL length, STRING** dest)

The result will be placed in dest. (Passing in dest avoids allocating a new string at runtime. If *dest is a null pointer, a new string structure is created with the same encoding as src.)

To retrieve a single character of the string, call

INTVAL string_ord(const STRING* s, INTVAL n)

The result will be returned from the function. It checks for the existence of s, and tests for n being out of range. Currently it applies the method that perl uses on arrays to handle negative indices. That is to say, negative values count backwards from the end of the string. For example, index -1 is the last character in the string, -2 is the next-to-last, and so on.

If s is null or s is zero-length, it throws an exception. If n is out of range, it also throws an exception.

To compare two strings, use:

INTVAL string_compare(struct Parrot_Interp *, STRING* s1, STRING* s2)

The value returned will be less than, equal to, or greater than zero depending on whether s1 is less than, equal to, or greater than s2.

Strings whose encodings are not the same can be compared - in this case a UTF-32 copy will be made of each string and these copies will be compared.

To test a string for truth, use:

BOOLVAL string_bool(STRING* s);

A string is false if it

o  is not yet allocated
o  has zero length
o  consists of one digit character whose numeric value (as
   decided by its character type) is zero.

Otherwise the string will be true.

To format output into a string, use

STRING* string_nprintf(struct Parrot_Interp *, STRING* dest, INTVAL len, char* format, ...)

dest may be a null pointer, in which case a new native string will be created. If len is zero, the behaviour becomes more sprintfish than snprintf-like.

Notes for Implementors

Termination

The character buffer pointed to by *bustart is not expected to be terminated by a nul byte and functions which provide the string api will not add one. Any functions which access the buffer directly and which require a terminating nul byte must place one there themselves and also be very careful about nul bytes within the used portion of the character buffer. In particular, if bufused == buflen more space must be allocated to hold a terminating byte.

Elements of the STRING structure

Those implementing the STRING API will obviously need to know about how the STRING structure works. You can find the definition of this structure in string.h:

struct parrot_string_t {
    void *bufstart;
    UINTVAL buflen;
    UINTVAL flags;
    UINTVAL bufused;
    void *strstart;
    UINTVAL strlen;
    const ENCODING *encoding;
    const CHARTYPE *type;
    INTVAL language;
};

Let's look at each element of this structure in turn.

bufstart

This pointer points to the buffer which holds the string, encoded in whatever is the string's specified encoding. Because of this, you should not make any assumptions about what's in the buffer, and hence you shouldn't try and access it directly.

buflen

This is used for memory allocation; it tells you the currently allocated size of the buffer in bytes.

flags

This is a general holding area for string flags. The exact flags required have not yet been determined.

bufused

bufused on the other hand, contains the number of bytes out of the allocated buffer which are actually in use. This, together with buflen, is used by the buffer growing algorithm to determine when and by how much to grow the allocation buffer.

strstart

This stores the actual start of the string. In the case of COW strings holding references to portions of a larger string, (for example, in regex match variables), this is a pointer into the start of the string.

strlen

This is the length of the string in characters, as you would expect to find from length $string in Perl. Again, because string buffers may be in one of a number of encodings, this must be computed by the appropriate encoding function. string_compute_strlen(STRING) updates this value, calling the compute_strlen function in the STRING's vtable.

encoding

This is a vtable of functions; the vtable should normally be taken from the array Parrot_string_vtable. Entries in this array specify the encoding of the string, from the following enum:

enum {
    enc_native,
    enc_utf8,
    enc_utf16,
    enc_utf32,
    enc_foreign,
    enc_max
};

The "native" string type is whatever happens when you set LANG=C in your shell; it's usually ISO-8859-1 in most English-speaking machines. A character equals a byte equals eight bits. No shifts, no wide characters, nothing.

UTF8, UTF16, and UTF32 are what they sound like. UTF16 and UTF32 should use the native endianness of the machine.

enc_foreign is there to allow for expansion; foreign strings will call functions from a user-defined string vtable instead of the Perl built-in ones.

enc_max isn't an encoding. These aren't the droids you're looking for. It's just there to help know how big to make arrays.

type

XXX I don't know what this is for.

language

This field is currently unused; however, it can be used to hold a pointer to the correct vtable for foreign strings.

String Vtable Functions

The "String Manipulation Functions" above are implemented in terms of string vtables to create encoding abstraction; here's an example of one:

STRING*
string_concat(struct Parrot_Interp *interpreter, STRING* a, STRING* b, INTVAL flags) {
    return (ENC_VTABLE(a).concat)(a, b, flags);
}

ENC_VTABLE(a) is shorthand for:

a->encoding

Vtables are taken from the Parrot_string_vtable array, defined in string.c. Each encoding has its own vtable; to call the concatenation function for a, we look up its vtable and retrieve the concat entry from that vtable. This produces a function pointer we can throw the arguments at.

To get the actual position in the array from the vtable, use the which entry, which returns an INTVAL index into Parrot_string_vtable.

Most of the string vtable functions are self-explanatory as they are thin wrappers around the functions given above. Some of them, however, are for internal use only, to help implement other functions. You'll find them in the next section.

How to add new vtable functions

The first thing to note is that if what you're doing isn't remotely encoding-specific, you don't need to add a vtable function; you can just add a function in string.c (don't forget to add the function prototype to string.h) and you don't need any more of this section. However, most things that people do with strings depend on the encoding of the string data, so if you need to add anything slightly complex, read on.

Currently, the construction of the vtables is not automated; it's hoped that soon someone will automate this and fix this section. However, for the time being, this is what you need to do when you implement a new vtable function:

  1. Check to see whether or not the function's type has a typedef in string.h: for instance, if you have a function that takes a string and an INTVAL and returns a string, use string_iv_to_string_t; otherwise, add your own type.

  2. Add the unqualified name of the function (frobnicate), together with your type, to string_vtable in string.h.

  3. Create a function string_frobnicate in string.c which is a wrapper around frobnicate. This function must take a STRING* parameter, so that the encoding can be extracted and the relevant encoding vtable be found and despatched. It should look something like this:

    yadda
    string_frobnicate(STRING *s, ...) {
        return (ENC_VTABLE(s).frobnicate)(s, ...);
    }
  4. Create functions string_XXX_frobnicate for all values of XXX in the encoding table; (or better still, get other people to write them for you) string_native_frobnicate should go in strnative.c, string_utf8_frobnicate should go in strutf8.c, and so on.

  5. Add string_XXX_frobnicate to the end of each vtable returned by string_XXX_vtable.

Non-user-visible String Manipulation Functions

If you've read this far, I hope you're a Parrot implementor. If you're not helping construct the Parrot core itself, you probably want to look away now.

The first two functions to note are

INTVAL string_compute_strlen(STRING* s)

and

INTVAL string_max_bytes(STRING *s, INTVAL iv)

The first updates the contents of s->strlen by contemplating the buffer bufstart and working out how many characters it contains. The second is given a number of characters which we assume are going to be added into the string at some point; it returns the maximum number of bytes that need to be allocated to admit that number of characters. For fixed-width encodings, this is trivial - the "native" encoding, for instance, encodes one byte per character, so string_native_max_bytes simply returns the INTVAL it is passed; string_utf8_max_bytes, on the other hand, returns three times the value that it is passed because a UTF8 character may occupy up to three bytes.

To grow a string to a specified size, use

void string_grow(struct Parrot_Interp *, STRING *s, INTVAL newsize)

The size is given in characters; string_max_bytes is called to turn this into a size in bytes, and then the buffer is grown to accomodate (at least) that many bytes.

Transcoding

The fact that Parrot strings are encoding-abstracted really has to bottom out at some point, and it's usually when two strings of different encodings interact. When we try to append one type of string to another, we have the option of turning the later string into a string that matches the first string's encoding. This process, translating a string from one encoding into another, is called "transcoding".

In Parrot, transcoding is implemented by the two-dimensional array

Parrot_transcode_table[enc_from][enc_to]

Each entry in this table is a function pointer which takes two parameters:

string_utf32_to_utf8(STRING* from, STRING* to)

(If to is a null pointer, a new STRING* will be allocated. As before, it's all about avoiding memory allocation at runtime.)

A null pointer in the table should signify that no transcoding is necessary; Parrot_transcode_table[x][x] should always be NULL.

Parrot_transcode_table[enc_native][enc_utf8] isn't NULL. Don't fall for that, because "native" doesn't necessarily mean ISO-8859-1.

Foreign Encodings

Fill this in later; if anyone wants to implement new encodings at this stage they must be mad.

Work In Progress

The transcoding section is out of sync with the code.

Should the following functions be mentioned? string_append, string_from_cstring, string_from_int, string_from_num, string_index, string_replace, string_set, string_str_index, string_to_cstring, string_to_int, string_to_num, string_transcode.

string_bool is here said to return BOOLVAL. But the code is returning INTVAL (2002Dec). Which is the right thing?