NAME

docs/strings.pod - Parrot Strings

ABSTRACT

This document describes how Parrot abstracts the programmer's interface to string types.

OVERVIEW

For various reasons, some of which relate to the sequence-of-integer abstraction, and some of which relate to "infinite" strings and arrays, Parrot Strings are represented by a list of chunks, where each chunk is a sequence of integers of the same size or representation, but different chunks can have different integer sizes or representations. The Parrot String API hides this from any module that wishes to work at the abstract string level. In particular, it must hide this from the regex engine, which works on pure sequences in the abstract.

So Parrot Strings are a wizzy internationalized equivalent of the old standard C library's string.h functions.

The Parrot String API

All strings used in the Parrot core should use the Parrot STRING structure; Parrot programmers should not deal with char * or other string-like types outside of this abstraction without very good reason.

Interface functions on STRINGs

In fact, programmers should hardly ever even access members of the STRING structure directly. The reason for this is that the interpretation of the data inside the structure will be a function of the data's encoding. The idea is that Parrot's strings are encoding-aware so your functions don't need to be; if you break the abstraction, you suddenly have to start worrying about what the data actually means.

String Constructors

The most basic way of creating a string is through the function string_make:

STRING* string_make(Interp *, const void *buffer, INTVAL buflen, INTVAL encoding, INTVAL flags, INTVAL type)

In here you pass a pointer to a buffer of a given encoding, and the number of bytes in that buffer to examine, the encoding, (see below for the enum which defines the different encodings) and the initial values of the flags and type field. These should usually be zero. In return, you'll get a brand new Parrot string. This string will have its own private copy of the buffer, so you don't need to keep it.

  • Hint: Nothing stops you doing

    string_make(interpreter, NULL, 0, ...

If you already have a string, you can make a copy of it by calling

STRING* string_copy(Interp *, STRING* s)

This is itself implemented in terms of string_make.

String Manipulation Functions

Unless otherwise stated, all lengths, offsets, and so on, are given in characters; you are not allowed to care about the byte representation of a string, so it doesn't make sense to give the values in bytes.

To find out the length of a string, use

INTVAL string_length(const STRING *s)

You may explicitly use s->strlen for this since it is such a useful operation.

To concatenate two strings - that is, to add the contents of string b to the end of string a, use:

STRING* string_concat(Interp *, STRING* a, STRING *b, INTVAL flag)

a is updated, and is also returned as a convenience. If the flag is set to a non-zero value, then b will be transcoded to a's encoding before concatenation if the strings are of different encodings. You almost certainly don't want to stick, say, a UTF-32 string on the end of a Big-5 string.

To repeat a string, (ie, turn 'xyz' into 'xyzxyzxyz') use:

STRING* string_repeat(Interp *, const STRING* s, UINTVAL n, STRING** d)

Which will repeat string s n times and store the result into d, which it also returns. If *d or **d is NULL, a new string will be allocated to hold the result. s is not modified by this operation. If d is not of the same type as s, it will be upgraded appropiately.

Chopping n characters off the end of a string is achieved with the unlikely-sounding

STRING* string_chopn(STRING* s, INTVAL n)

To retrieve a substring of the string, call

STRING* string_substr(Interp *, STRING* src, INTVAL offset, INTVAL length, STRING** dest)

The result will be placed in dest. (Passing in dest avoids allocating a new string at runtime. If *dest is a null pointer, a new string structure is created with the same encoding as src.)

To retrieve a single character of the string, call

INTVAL string_ord(const STRING* s, INTVAL n)

The result will be returned from the function. It checks for the existence of s, and tests for n being out of range. Currently it applies the method that perl uses on arrays to handle negative indices. That is to say, negative values count backwards from the end of the string. For example, index -1 is the last character in the string, -2 is the next-to-last, and so on.

If s is null or s is zero-length, it throws an exception. If n is out of range, it also throws an exception.

To compare two strings, use:

INTVAL string_compare(Interp *, STRING* s1, STRING* s2)

The value returned will be less than, equal to, or greater than zero depending on whether s1 is less than, equal to, or greater than s2.

Strings whose encodings are not the same can be compared - in this case a UTF-32 copy will be made of each string and these copies will be compared.

To test a string for truth, use:

INTVAL string_bool(STRING* s);

A string is false if it

o  is not yet allocated
o  has zero length
o  consists of one digit character whose numeric value (as
   decided by its character type) is zero.

Otherwise the string will be true.

To format output into a string, use

STRING* string_nprintf(Interp *, STRING* dest, INTVAL len, char* format, ...)

dest may be a null pointer, in which case a new string will be created. If len is zero, the behaviour becomes more sprintfish than snprintf-like.

Notes for Implementors

Termination

The character buffer pointed to by *strstart is not expected to be terminated by a nul byte and functions which provide the string api will not add one. Any functions which access the buffer directly and which require a terminating nul byte must place one there themselves and also be very careful about nul bytes within the used portion of the character buffer. In particular, if bufused == buflen more space must be allocated to hold a terminating byte.

Elements of the STRING structure

Those implementing the STRING API will obviously need to know about how the STRING structure works. You can find the definition of this structure in pobj.h:

struct parrot_string_t {
    pobj_t obj;
    UINTVAL bufused;
    void *strstart;
    UINTVAL strlen;
    const ENCODING *encoding;
    const CHARTYPE *type;
    INTVAL language;
};

Let's look at each element of this structure in turn.

obj.u.b.bufstart

This pointer points to the buffer which holds the string, encoded in whatever is the string's specified encoding. Because of this, you should not make any assumptions about what's in the buffer, and hence you shouldn't try and access it directly.

obj.u.b.buflen

This is used for memory allocation; it tells you the currently allocated size of the buffer in bytes.

obj.flags

This is a general holding area for string flags. The exact flags required have not yet been determined.

bufused

bufused on the other hand, contains the number of bytes out of the allocated buffer which are actually in use. This, together with buflen, is used by the buffer growing algorithm to determine when and by how much to grow the allocation buffer.

strstart

This stores the actual start of the string. In the case of COW strings holding references to portions of a larger string, (for example, in regex match variables), this is a pointer into the start of the string.

strlen

This is the length of the string in characters, as you would expect to find from length $string in Perl. Again, because string buffers may be in one of a number of encodings, this must be computed by the appropriate encoding. string_compute_strlen(STRING) updates this value, calling the encoding's characters() function.

encoding

This specifies the encoding used to encode the characters in the data. There are currently four character encodings used in Parrot: singlebyte, UTF-8, UTF-16 and UTF-32. UTF-16 and UTF-32 should use the native endianness of the machine.

type

This specifes the character set for the string. There are currently two character sets in Parrot: US ASCII and Unicode. Each character set has a default encoding. The default character set is US ASCII.

language

This field is currently unused; however, it can be used to hold a pointer to the correct vtable for foreign strings.

Non-user-visible String Manipulation Functions

If you've read this far, I hope you're a Parrot implementor. If you're not helping construct the Parrot core itself, you probably want to look away now.

The first two functions to note are

INTVAL string_compute_strlen(STRING* s)

and

INTVAL string_max_bytes(STRING *s, INTVAL iv)

The first updates the contents of <s-strlen>> by contemplating the buffer strstart and working out how many characters it contains. The second is given a number of characters which we assume are going to be added into the string at some point; it returns the maximum number of bytes that need to be allocated to admit that number of characters. For fixed-width encodings, this is trivial - the singlebyte encoding, for instance, encodes one byte per character, so string_max_bytes() simply returns the INTVAL it is passed; calling string_max_bytes() on a UTF-8 string, on the other hand, returns three times the value that it is passed because a UTF-8 character may occupy up to three bytes.

To grow a string to a specified size, use

void string_grow(Interp *, STRING *s, INTVAL newsize)

The size is given in characters; string_max_bytes() is called to turn this into a size in bytes, and then the buffer is grown to accomodate (at least) that many bytes.

Transcoding

The fact that Parrot strings are encoding-abstracted really has to bottom out at some point, and it's usually when two strings of different encodings interact. When we try to append one type of string to another, we have the option of turning the later string into a string that matches the first string's encoding. This process, translating a string from one encoding into another, is called "transcoding".

In Parrot, transcoding is implemented by Parrot_CharType_Transcode functions which take two character sets (CHARTYPE) and a character (Parrot_UInt) and returns the character converted from the first to the second character set.

Each CHARTYPE has a number of transcoders associated with it, of which those to and from Unicode are explicitly singled out because of their expected frequent use. The transcoders array is currently not used.

Foreign Encodings

Fill this in later; if anyone wants to implement new encodings at this stage they must be mad.

SEE ALSO

src/string.c, include/parrot/string.h, include/parrot/string_funcs.h.

HISTORY

4 October 2003

Revised to reflect changes since Buffer/PMC unification.