NAME
Unicode::Map - maps charsets from and to UCS2 unicode
ALPHA release of $Date: 1998/02/12 15:01:18 $
SYNOPSIS
use Unicode::Map();
1. Standard case:
I<$Map> = new Unicode::Map({ ID => "ISO-8859-1" });
I<$_16bit> = I<$Map> -> to_unicode ("Hello world!");
=> $_16bit == "\0H\0e\0l\0l\0o\0 \0w\0o\0r\0l\0d\0!"
I<$_8bit> = I<$Map> -> from_unicode (I<$_16bit>);
=> $_8bit == "Hello world!"
2. If you need different charsets:
I<$Map> = new Unicode::Map;
I<$_16bit> = I<$Map> -> to_unicode ("ISO-8859-1", "Hello world!");
=> $_16bit == "\0H\0e\0l\0l\0o\0 \0w\0o\0r\0l\0d\0!"
I<$_8bit> = I<$Map> -> from_unicode ("ISO-8859-7", I<$_16bit>);
=> $_8bit == "Hello world!"
More methods and a more detailed description below.
DESCRIPTION
This module converts strings from and to 2-byte Unicode UCS2 format. Available character sets, their names and their aliases are defined in the file REGISTRY
in the Unicode::Map hierarchy.
Character mapping is according to the data of binary mapfiles in Unicode::Map hierarchy. Binary mapfiles can also be created with this module, so that you could install your specific character sets.
Normally it is sufficient to map 1 character to 1 unicode character and vice versa. Apple defines some 1 character to n unicode character mappings, so that this handling is implemented also.
Have a look at utility map coming along with this.
If you need neither n
chars -> m
chars mappings, nor 16 bit -> 16 bit mappings, I recommend to use the high performance 8 bit <-> 16 bit module Unicode::Map8 by Gisle Aas instead.
CONVERSION METHODS
- from_unicode
-
1
||0
= $Map -> from_unicode (($csid,) $src||\$src, \$dest)$dest = $Map -> from_unicode (($csid,) $src||\$src)
Converts a UTF16 Unicode encoded string into $csid character set representation. String is taken from $src. If specified, converted string is stored in variable $dest. If not specified it is simply returned.
Parameter $csid has to be used, when it was omitted at constructor
new
.You can use
to8
as synonym forfrom_unicode
. - new
-
$Map = new Unicode::Map()
Returns a new Map object. Method new can be initialized via an anonymous hash with an instance $Startup of OLE::Storage::Startup:
I<$Map> = new Unicode::Map({ ID => I<$csid>, STARTUP => I<$Startup> })
The module then would send comments and error messages to $Startup. You can change the verbosity of comments with method noise. Module Startup is in very early development and is packed among OLE::Storage distribution, it is not published separately.
- noise
-
$Map -> noise ($n)
Defines the verbosity of messages to user sent via $Startup. Can be no messages at all (n=0), some information (n=1) or some more information (n=3). Default is n=1.
- reverse_unicode
-
$string = $Map -> reverse_unicode ($string)
One Unicode character, precise one UCS2 (UTF16) character, consists of two bytes. Therefore it is important, in which order these bytes are stored. As far as I could figure out, Unicode characters are assumed to be in "Network order" (0x1234 => 0x12, 0x34). Alas, many PC Windows documents store Unicode characters internally in "Vax order" (0x1234 => 0x34, 0x12). With this method you can convert "Vax mode" -> "Network mode" and vice versa.
If possible, reverse_unicode changes the original variable!
- to_unicode
-
1
||0
= $Map -> to_unicode (($csid,) $src||\$src, \$dest)$dest = $Map -> to_unicode (($csid,) $src||\$src)
Converts a $csid encoded string into UTF16 Unicode character set representation. String is taken from $src. If specified, converted string is stored in variable $dest. If not specified it is simply returned.
Parameter $csid has to be used, when it was omitted at constructor
new
.You can use
to16
as synonym forto_unicode
.
MAINTAINANCE METHODS
- alias
-
@list = $Map -> alias ($csid)
Returns a list of alias names of character set $csid.
- dest
-
$path = $Map -> dest ($csid)
Returns the relative path of binary character mapping for character set $csid according to REGISTRY file of Unicode::Map.
- id
-
$real_id||
""
= $Map -> id ($test_id)Returns a valid character set identifier $real_id, if $test_id is a valid character set name or alias name according to REGISTRY file of Unicode::Map.
- ids
-
@ids = $Map -> ids()
Returns a list of all character set names defined in REGISTRY file.
- read_text_mapping
-
1
||0
= $Map -> read_text_mapping ($csid, $path, $style)Read a text mapping of style $style named $csid from filename $path. The mapping then can be saved to a file with method: write_binary_mapping. <$style> can be:
style description "unicode" A text mapping as of ftp://ftp.unicode.org/MAPPINGS/ "" Same as "unicode" "reverse" Similar to unicode, but both columns are switched "keld" A text mapping as of ftp://dkuug.dk/i18n/charmaps/
- src
-
$path = $Map -> src ($csid)
Returns the path of textual character mapping for character set $csid according to REGISTRY file of Unicode::Map.
- style
-
$path = $Map -> style ($csid)
Returns the style of textual character mapping for character set $csid according to REGISTRY file of Unicode::Map.
- write_binary_mapping
-
1
||0
= $Map -> write_binary_mapping ($csid, $path)Stores a mapping that has been loaded via method read_text_mapping in file $path.
BINARY MAPPINGS
Structure of binary Mapfiles
Unicode character mapping tables have sequences of sequential key and sequential value codes. This property is used to crunch the maps easily. n (0<n<256) sequential characters are represented as a bytecount n and the first character code key_start. For these subsequences the according value sequences are crunched together, also. The value 0 is used to start an extended information block (that is just partially implemented, though).
One could think of two ways to make a binary mapfile. First method would be first to write a list of all key codes, and then to write a list of all value codes. Second method, used here, appends to all partial key code lists the according crunched value code lists. This makes value codes a little bit closer to key codes.
Note: the file format is still in a very liquid state. Neither rely on that it will stay as this, nor that the description is bugless, nor that all features are implemented.
STRUCTURE:
- <main>:
-
offset structure value 0x00 word 0x27b8 (magic) 0x02 @(<extended> || <submapping>)
The mapfile ends with extended mode <end> in main stream.
- <submapping>:
-
0x00 byte != 0 charsize1 (bits) 0x01 byte n1 number of chars for one entry 0x02 byte charsize2 (bits) 0x03 byte n2 number of chars for one entry 0x04 @(<extended> || <key_seq> || <key_val_seq) bs1=int((charsize1+7)/8), bs2=int((charsize2+7)/8)
One submapping ends when <mapend> entry occurs.
- <key_val_seq>:
-
0x00 size=0|1|2|4 n, number of sequential characters size bs1 key1 +bs1 bs2 value1 +bs2 bs1 key2 +bs1 bs2 value2 ...
key_val_seq ends, if either file ends (n = infinite mode) or n pairs are read.
- <key_seq>:
-
0x00 byte n, number of sequential characters 0x01 bs1 key_start, first character of sequence 1+bs1 @(<extended> || <val_seq>)
A key sequence starts with a byte count telling how long the sequence is. It is followed by the key start code. After this comes a list of value sequences. The list of value sequences ends, if sum(m) equals n.
- <val_seq>:
-
0x00 byte m, number of sequential characters 0x01 bs2 val_start, first character of sequence
- <extended>:
-
0x00 byte 0 0x01 byte ftype 0x02 byte fsize, size of following structure 0x03 fsize bytes something
For future extensions or private use one can insert here 1..255 byte long streams. ftype can have values 30..255, values 0..29 are reserved. Modi are not fully defined now and could change. They will be explained later.
TO BE DONE
- -
-
Something clever, when a character has no translation.
- -
-
Direct charset -> charset mapping.
- -
-
Velocity.
- -
-
Support for mappings according to RFC 1345.
- -
-
Something clever to include partial character sets to character sets. This for those charset definitions, that by what reason ever don't like to include mappings for control codes.
- -
-
The "REGISTRY" concept is somehow weird...
SEE ALSO
- -
-
File
REGISTRY
and binary mappings in directoryUnicode/Map
of your perl library path - -
-
recode(1), map(1), mkmapfile(1), Unicode::Map(3), Unicode::Map8(3), Unicode::String(3), Unicode::CharName(3)
- -
-
RFC 1345
- -
-
Mappings at Unicode consortium ftp://ftp.unicode.org/MAPPINGS/
- -
-
Registrated Internet character sets ftp://dkuug.dk/i18n/charmaps/
AUTHOR
Martin Schwartz <schwartz@cs.tu-berlin.de>
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 834:
You can't have =items (as at line 838) unless the first thing after the =over is an =item