NAME

Prima::Drawable::Glyphs - helper routines for bi-directional text input and complex scripts output

SYNOPSIS

use Prima;
$::application-> begin_paint;
‭$::application-> text_shape_out('אפס123', 0,0);

‭123ספא

DESCRIPTION

The class implements an abstraction layer by organizing arrays filled with information about glyphs as a structure that can be used to render text strings. Objects of the class are created and returned by the Prima::Drawable::text_shape method, see more in "text_shape" in Prima::Drawable. A Prima::Drawable::Glyphs object is a blessed array reference that can contain either two, four, or five packed arrays with 16-bit integers, representing, correspondingly, a set of glyph indexes, a set of character indexes, a set of glyph advances, a set of glyph position offsets per glyph, and a font index. Additionally, the class implements several sets of helper routines that aim to address common tasks when displaying glyph-based text.

Structure

Each sub-array is an instance of the Prima::array class, an effective plain memory structure that provides a standard perl interface over a string scalar filled with fixed-width integers.

The following methods provide read-only access to these arrays:

glyphs

Contains a set of unsigned 16-bit integers where each is a glyph number corresponding to the font that was used for shaping the text. The glyph numbers are only applicable to the font used in the shaping process. Zero is usually treated as a default glyph in vector fonts when shaping cannot map a character; in bitmap fonts, this number is usually the same as defaultChar.

The glyphs array is recognized as a special case when is sent to the text_out or get_text_width methods that can process it natively. In this case, no special advances and glyph positions are taken into account.

Each glyph is not necessarily mapped to a text character, and quite often is not, even in English left-to-right texts. F ex character combinations like "ff", "fi", and "fl" may be mapped to single ligature glyphs. When right-to-left, RTL, text direction is taken into account, the glyph positions may change, too. See indexes below that addresses the mapping of glyphs to characters.

indexes

Contains a set of unsigned 16-bit integers where each is a text offset corresponding to the text used in the shaping process. Each glyph position points to the first character in the text that maps to the glyph.

There can be more than one character per glyph, such as the above example with the "ff" ligature. There can also be cases with more than one character per more than one glyph, f ex in indic scripts. In these cases it is easier to operate neither by character offsets nor by glyph offsets, but rather by clusters, where each cluster is an individual syntax unit that contains one or more characters per one or more glyphs.

In addition to the text offset, each index value can be flagged with a to::RTL bit, signifying that the character in question has RTL direction. This is not necessarily semitic characters from RTL languages that only have that attribute set; spaces in these languages are normally attributed with the RTL bit too, sometimes also numbers. The use of explicit direction control characters from the U+20XX block can result in any character being assigned or not assigned the RTL bit.

The array has an extra item added to its end, the length of the text that was used for the shaping. This helps calculate the cluster length in characters, especially of the last one, where the difference between indexes is, basically, the cluster length.

The array is not used for text drawing or calculation, but only for conversion between character, glyph, and cluster coordinates (see Coordinates below).

advances

Contains a set of unsigned 16-bit integers where each is a pixel distance of how much space the corresponding glyph occupies. Where the advances array is not present, or was force-filled by advances options by the text_shape method, a glyph advance value is basically the sum of a, b, and c widths of the corresponding glyph. However there are cases when depending on the shaping input, these values can differ.

One of those cases is the combining graphemes, where the text consisting of two characters, "A" and the combining grave accent U+300 should be drawn as a single "À" symbol, and where the font doesn't have that single glyph but rather two individual glyphs "A" and "`". Even though the grave glyph has its own advance for standalone usage, in this case, it should be ignored; this is achieved by the shaper setting the advance of the "`" to zero.

The array content is respected by the text_out and get_text_width methods, and its content can be changed at will to produce gaps in the text quite easily. F ex Prima::Edit uses that to display tab characters as spaces with the 8x advance.

positions

Contains a set of pairs of signed 16-bit integers where each is an X and Y pixel offset for each glyph. Like in the previous example with the "À" symbol, the grave glyph "`" may be positioned differently on the vertical axis in "À" and "à" graphemes, for example.

The array is respected by text_out (but not by get_text_width).

fonts

Contains a set of unsigned 16-bit integers where each is an index in the font substitution list (see "font_mapper" in Prima::Drawable). Zero means the current font.

The font substitution is applied by the text_shape method when the polyfont option is set (it is by default), and when the shaper cannot match all characters in the text to the glyphs using the current font. If the current font contains all the needed glyphs, this entry is not present at all.

The array is respected by the text_out and get_text_width methods.

Coordinates

In addition to the natural character coordinates, where each index is a text offset that can be directly used in the substr perl function, the Prima::Drawable::Glyphs class offers two additional coordinate systems that help abstract the object data for the display and navigation.

The glyph coordinate system is a rather straightforward copy of the character coordinate system, where each number is an offset in the glyphs array. Similarly, these offsets can be used to address individual glyphs, indexes, advances, and positions. However, these are not easy to use when one needs, for example, to select a grapheme with a mouse, or break a set of glyphs in such a way that a grapheme is not broken. These use cases can be managed more easily in the cluster coordinate system.

The cluster coordinates represent a virtually superimposed set of offsets where each corresponds to a set of one or more characters displayed by one or more glyphs. The most useful functions below operate in this system.

Visual selection

The coordinates that are best used for implementing the visual selection are either characters or clusters, but not glyphs. The charater-based selection makes it trivial to extract or replace the selected text, while the cluster-based makes it easier to manipulate (f ex with Shift- arrow keys) the selection itself.

The class supports both, by operating on selection maps or selection chunks, where each represents the same information but in different ways. For example, consider an embedded number in a bidi text. For the sake of clarity, I'll use Latin characters here. Let's imagine a text scalar containing these characters:

ABC123

where ABC is a right-to-left text that, if rendered on the screen should be displayed as

123CBA

(and the indexes, i e the offsets of the first characters for each glyph, are (3,4,5,2,1,0) ).

Next, the user clicks the mouse between the glyphs A and B (in the text offset of 1), drags the mouse to the left, and finally stops between the characters 2 and 3 (in the text offset of 4). The resulting selection then should not be, as one might naively expect, this:

123CBA
__^^^_

but this instead:

123CBA
^^_^^_

because the next character after C is 1, and the range of the selected sub-text is from characters 1 to 4.

The class offers means to encode such information in a map, i.e. an array of integers 1,1,0,1,1,0, where each entry is either 0 or 1 depending on whether the cluster is or is not selected. Alternatively, the same information can be encoded in chunks, or RLE sets, as an array 0,2,1,2,1, where the first integer signifies the number of non-selected clusters to display, the second - the number of selected clusters, the third the non-selected again, etc. If the first character belongs to the selected chunk, the first integer in the result is set to 0.

Bidi input

When sending an input to a widget to type some text, the otherwise trivial case of figuring out at which position the text should be inserted (or removed, for that matter), becomes interesting when there are characters with mixed input direction.

F ex it is indeed trivial, when the Latin text is AB, and the cursor is positioned between A and B, to figure out that whenever the user types C, the result should become ACB. Likewise, when the text is RTL and both text and input are Arabic, the result is the same. However when f.ex. the text is A1, which is displayed as 1A because of the RTL shaping, and the cursor is positioned between the 1 (LTR) and A (RTL) glyphs, it is not clear whether that means the new input should be appended after 1 and become A1C, or after A, and become, correspondingly, AC1.

There is no easy solution for this problem, and different programs approach this differently, where some go as far as to provide two cursors for both input directions. The class offers its own solution that uses some primitive heuristics to detect whether the cursor belongs to the left or the right glyph. This is the area that can be enhanced, and any help from native users of the languages that use the right-to-left writing system can be greatly appreciated.

API

abc $CANVAS, $INDEX

Returns the a, b, c metrics from the glyph $INDEX

advances

A read-only accessor to the advances array, see Structure above.

clone

Clones the object

cluster2glyph $FROM, $LENGTH

Maps the range of clusters starting with $FROM with size $LENGTH into the corresponding range of glyphs. Undefined $LENGTH calculates the range from $FROM to the object's end.

cluster2index $CLUSTER

Returns character offset of the first character in the cluster $CLUSTER.

Note: result may contain to::RTL flag.

cluster2range $CLUSTER

Returns character offset of the first character in the cluster $CLUSTER and the number of characters in the cluster.

clusters

Returns an array of integers where each is the offset of the first character in each cluster.

cursor2offset $AT_CLUSTER, $PREFERRED_RTL

Given the cursor is positioned next to the cluster $AT_CLUSTER, runs simple heuristics to calculate what character offset it corresponds to. The $PREFERRED_RTL flag is used when object data does not have enough information to decide the text direction.

See "Bidi input" above.

def $CANVAS, $INDEX

Returns the d, e, f metrics from the glyph $INDEX

fonts

A read-only accessor to the font indexes, see Structure above.

get_box $CANVAS

Return box metrics of the glyph object.

See "get_text_box" in Prima::Drawable.

get_sub $FROM, $LENGTH

Extracts and clones a new object that contains data from cluster offset $FROM with cluster length $LENGTH.

get_sub_box $CANVAS, $FROM, $LENGTH

Calculate the box metrics of the glyph string from the cluster $FROM with size $LENGTH.

get_sub_width $CANVAS, $FROM, $LENGTH

Calculate the pixel width of the glyph string from the cluster $FROM with size $LENGTH.

get_width $CANVAS, $WITH_OVERHANGS

Returns the width of the glyph objects, with overhangs if requested.

glyph2cluster $GLYPH

Return the cluster that contains $GLYPH.

glyphs

A read-only accessor to the glyph indexes array, see Structure above.

glyph_lengths

Returns an array where each glyph position is the number of how many glyphs the corresponding cluster occupies

index2cluster $INDEX, $ADVANCE = 0

Returns the cluster that contains the character offset $INDEX.

Set the $ADVANCE 1 to add the RTL-dependent advance to the resulting cluster

indexes

A read-only accessor to the indexes, see Structure above.

index_lengths

Returns an array where each glyph position is the number of how many characters the corresponding cluster occupies

justify CANVAS, TEXT, WIDTH, %OPTIONS

An umbrella call for justify_interspace if $OPTIONS{letter} or $OPTIONS{word} is set; for justify_arabic if $OPTIONS{kashida} is set; and for justify_tabs if $OPTIONS{tabs} is set.

Returns a boolean flag whether the glyph object was changed or not.

justify_arabic CANVAS, TEXT, WIDTH, %OPTIONS

Performs justifications of Arabic TEXT with kashida to the given WIDTH, returns either a success flag, or a new text with explicit tatweel characters inserted.

my $text = "\x{6a9}\x{634}\x{6cc}\x{62f}\x{647}";
my $g = $canvas->text_shape($text) or return;
$canvas->text_out($g, 10, 50);
$g->justify_arabic($canvas, $text, 200) or return;
$canvas->text_out($g, 10, 10);

Inserts tatweels only between Arabic letters that did not form any ligatures in the glyph object, max one tatweel set per word (if any). Does not apply the justification if the letters in the word are rendered as LTR due to embedding or explicit shaping options; only does justification on RTL letters. If for some reason newly inserted tatweels do not form a monotonically increasing series after shaping, skips the justifications in that word.

Note: Does not use the JSTF font table, on Windows results may be different from the native rendering.

Options:

If justification is found to be needed, eventual ligatures with newly inserted tatweel glyphs are resolved via a call to text_shape(%OPTIONS) - so any needed shaping options, such as language, may be passed there.

as_text BOOL = 0

If set, returns the new text with inserted tatweels, or undef if no justification is possible.

If unset, runs in-place justification on the caller glyph object, and returns the boolean success flag.

min_kashida INTEGER = 0

Specifies the minimal width of a kashida strike to be inserted.

kashida_width INTEGER

During the calculation, the width of the tatweel glyph is needed - unless supplied by this option, it is calculated dynamically. Also, when called in the list context, and succeeds, returns a 1, kashida_width tuple that can be reused in subsequent calls.

justify_interspace CANVAS, TEXT, WIDTH, %OPTIONS

Performs an in-place inter-letter and/or inter-word justification of TEXT to the given WIDTH. Returns either a boolean flag whether there were any changes made, or, the new text with explicit space characters inserted.

Options:

as_text BOOL = 0

If set, returns new text with inserted spaces, or undef if no justification is possible.

If unset, runs in-place justification on the caller glyph object, and returns the boolean success flag.

letter BOOL = 1

If set, runs an inter-letter spacing on all glyphs.

max_interletter FLOAT = 1.05

When the inter-letter spacing is applied, it is applied first, so that the width of the resulting text line can take up to $OPTIONS{max_interletter} * glyph_width pixels.

The inter-word spacing does not have such a limit, and in the worst case can produce two words moved to the left and the right edges of the enclosing 0 - WIDTH-1 rectangle.

space_width INTEGER

as_text mode: during the calculation, the width of the space glyph may be needed. Unless supplied by $OPTIONS{space_width}, it is calculated dynamically. Also, when called in the list context, and succeeds, returns the 1, space_width tuple that can be reused in subsequent calls.

word BOOL = 1

If set, runs an inter-word spacing by extending advances on all space glyphs.

min_text_to_space_ratio FLOAT = 0.75

If the word option set, does not run inter-word justification if the text-to-space ratio is too small (to not spread the text too thin)

justify_tabs CANVAS, TEXT, %OPTIONS

Expands the tab characters as $OPTIONS{tabs} (default:8) spaces.

Needs the advance of the space glyph to replace the tab glyph. If no $OPTIONS{glyph} and $OPTIONS{width} are specified, calculates them.

Returns a boolean flag whether there were any changes made. On success, if called in the list context, returns also the space glyph ID and space glyph width for eventual use on the later calls.

left_overhang

The first integer from the overhangs result.

log2vis

Returns a map of integers where each character position corresponds to the glyph position. The name is a rudiment from pure fribidi shaping, where log2vis and vis2log were mapper functions with the same functionality.

n_clusters

Calculates how many clusters are there in the object

new @ARRAYS

Creates a new object. Is not used directly, created automatically inside the text_shape method.

new_array NAME

Creates an array suitable for direct insertion to the object, if manual construction of the object is needed. F ex one may set the missing fonts array like this:

$obj->[ Prima::Drawable::Glyphs::FONTS() ] = $obj->new_array('fonts');
$obj->fonts->[0] = 1;

The newly created array is filled with zeros.

new_empty

Creates a new empty object.

overhangs

Calculates two widths for overhangs at the beginning and at the end of the glyph string. This is used in the emulation of the get_text_width method with the to::AddOverhangs flag.

positions

A read-only accessor to the positions array, see Structure above.

reorder_text TEXT

Returns a visual representation of TEXT assuming it was the input of the text_shape call that created the object.

reverse

Creates a new object that has all arrays reversed. Used for calculation of the pixel offset from the right end of a glyph string.

right_overhang

The second integer from the overhangs result.

selection2range $CLUSTER_START $CLUSTER_END

Converts cluster selection range into text selection range

selection_chunks_clusters, selection_chunks_glyphs $START, $END

Converts text selection given as the visual range between $START and $END into a set of integers (chunks), where each is the number or selected or not-selected clusters or glyphs. The first chunk is a number of non-selected items and is 0 if the first cluster or glyph is selected.

selection_diff $OLD, $NEW

Given two chunk sets in the format as returned by selection_chunks_clusters or selection_chunks_glyphs, calculates the new set of chunks where each integer value corresponds to the number of the clusters or glyphs affected by the transition from the $OLD to $NEW visual selection. The first chunk is the number of non-affected items and is 0 if the first cluster or glyph is affected by the selection change.

Can be used for efficient repaints when the user interactively changes text selection, to redraw only the changed regions.

selection_map_clusters, selection_map_glyphs $START, $END

Same as selection_chunks_XXX, but instead of RLE chunks returns a full array for each cluster/glyph, where each entry is a boolean value corresponding to whether that cluster/glyph is to be displayed as selected or not.

selection_walk $CHUNKS, $FROM, $TO = length, $SUB

Walks the selection chunks array, returned by selection_chunks, between $FROM and $TO clusters/glyphs. Calls the provided $SUB->($offset, $length, $selected) for each chunk where each call contains 2 integers - the chunk offset and its length, and a boolean flag whether the chunk is selected or not.

Can be also used on a result of selection_diff, in which case the $selected flag shows whether the chunk is affected by the selection change or not.

sub_text_out $CANVAS, $FROM, $LENGTH, $X, $Y

An optimized version of $CANVAS->text_out( $self->get_sub($FROM, $LENGTH), $X, $Y ).

sub_text_wrap $CANVAS, $FROM, $LENGTH, $WIDTH, $OPT, $TABS

An optimized version of $CANVAS->text_wrap( $self->get_sub($FROM, $LENGTH), $WIDTH, $OPT, $TABS ). The result is also converted to chunks.

text_length

Returns the length of the text that was shaped and that produced the object.

x2cluster $CANVAS, $X, $FROM, $LENGTH

Given the sub-cluster from $FROM with size $LENGTH, calculates how many clusters would fit in $X pixels.

_debug

Dumps the glyph object content in a readable format.

EXAMPLES

This section is only there to test proper rendering

Latin

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
Latin combining

D̍üi̔s͙ a̸u̵t͏eͬ ịr͡u̍r͜e̥ d͎ǒl̋o̻rͫ i̮n̓ r͐e̔p͊rͨe̾h̍e͐n̔ḋe͠r̕i̾t̅ ịn̷ vͅo̖lͦuͦpͧt̪ątͅe̪

v̰e̷l̳i̯t̽ e̵s̼s̈e̮ ċi̵l͟l͙u͆m͂ d̿o̙lͭo͕r̀e̯ ḛu̅ fͩuͧg̦iͩa̓ť n̜u̼lͩl͠a̒ p̏a̽r̗i͆a͆t̳űr̀
Cyrillic

Lorem Ipsum используют потому, что тот обеспечивает более или менее стандартное заполнение шаблона.

а также реальное распределение букв и пробелов в абзацах
Hebrew

זוהי עובדה מבוססת שדעתו של הקורא תהיה מוסחת על ידי טקטס קריא כאשר הוא יביט בפריסתו.

המטרה בשימוש ב-Lorem Ipsum הוא שיש לו פחות או יותר תפוצה של אותיות, בניגוד למלל
Arabic

العديد من برامح النشر المكتبي وبرامح تحرير صفحات الويب تستخدم لوريم إيبسوم بشكل إفتراضي

كنموذج عن النص، وإذا قمت بإدخال "lorem ipsum" في أي محرك بحث ستظهر العديد من
Hindi

Lorem Ipsum के अंश कई रूप में उपलब्ध हैं, लेकिन बहुमत को किसी अन्य रूप में परिवर्तन का सामना करना पड़ा है, हास्य डालना या क्रमरहित शब्द ,

जो तनिक भी विश्वसनीय नहीं लग रहे हो. यदि आप Lorem Ipsum के एक अनुच्छेद का उपयोग करने जा रहे हैं, तो आप को यकीन दिला दें कि पाठ के मध्य में वहाँ कुछ भी शर्मनाक छिपा हुआ नहीं है.
Chinese

无可否认,当读者在浏览一个页面的排版时,难免会被可阅读的内容所分散注意力。

Lorem Ipsum的目的就是为了保持字母多多少少标准及平
Thai

มีหลักฐานที่เป็นข้อเท็จจริงยืนยันมานานแล้ว ว่าเนื้อหาที่อ่านรู้เรื่องนั้นจะไปกวนสมาธิของคนอ่านให้เขวไปจากส่วนที้เป็น Layout เรานำ Lorem Ipsum มาใช้เพราะความที่มันมีการกระจายของตัวอักษรธรรมดาๆ แบบพอประมาณ ซึ่งเอามาใช้แทนการเขียนว่า ‘ตรงนี้เป็นเนื้อหา, ตรงนี้เป็นเนื้อหา' ได้ และยังทำให้มองดูเหมือนกับภาษาอังกฤษที่อ่านได้ปกติ ปัจจุบันมีแพ็กเกจของซอฟท์แวร์การทำสื่อสิ่งพิมพ์ และซอฟท์แวร์การสร้างเว็บเพจ

กวนสมาธิของคนอ่านให้เขวไปจากส่วนที้เป็น Layout เรานำ Lorem Ipsum

(Note: libthai is required for text wrapping by the word boundary)

Largest well-known grapheme cluster in Unicode

ཧྐྵྨླྺྼྻྂ

http://archives.miloush.net/michkap/archive/2010/04/28/10002896.html.

AUTHOR

Dmitry Karasik, <dmitry@karasik.eu.org>.

SEE ALSO

examples/bidi.pl