NAME

Unicode::Japanese - Japanese Charset Converting

SYNOPSIS

use Unicode::Japanese;

# convert utf8 -> sjis

print Unicode::Japanese->new($str)->sjis;

# convert sijs -> utf8

print Unicode::Japanese->new($str,'sjis')->get;

# convert sjis(imode_EMOJI) -> utf8

print Unicode::Japanese->new($str,'sjis-imode')->get;

# convert FULL WIDTH CHARACTER(utf8) -> HALF WIDTH CHARACTER(utf8)

print Unicode::Japanese->new($str)->z2h->get;

DESCRIPTION

is a module for mutual barter of code written in Japanese. The following are special features of Unicode::Japanese.

  • The instance maintains strings with UTF-8.

  • is able to be used only with copy Japanese.pm, because unused XS.

  • suports transformation from FULL WIDTH CHARACTER to HARF WIDTH CHARACTER or from HARF WIDTH CHARACTER to FULL WIDTH CHARACTER.

  • treats safely "EMOJI" of cellular phones (DoCoMo i-mode, ASTEL dot-i, and J-PHONE J-Sky) on a DB by mapping them on your possess Unicode.

  • makes cellular phones to be able to barter same image of EMOJI between diffrent operating systems mutually.

  • makes correspondence SJIS to MS-CP932 by mapping. (SJIS char strings exist on MS-CP932 CHARSET in Windows' type, they generally exsist on Shift_JIS CARSET in the other type. In Unicode::Japanese, SJIS strings exsist in it on MS-CP932 CHARSET, same as in Windows' type.)

  • Unicode -> SJIS(EUC/JIS) converted to &#xxxx;, if the character cannot be converted to SJIS. (except "EMOJI")

METHODS

$s = Unicode::Japanese->new($str [, $icode [, $encode]])

Create a new instance of Unicode::Japanese.

Unicode::Japanese will be transfered to set_METHOD, if you specify parameter,

$s->set($str [, $icode [, $encode]])

$str: a string $icode: a character set, can be omitted, becomes 'utf8'when be omitted. $encode: a character encoding, can be omitted.

Set a string to the instance. Abbreviated specifying of a charset, the charset will be read as 'UTF-8'.

If you specify a charset, choose and specify from the following; 'jis', 'sjis', 'euc', 'utf8', 'ucs2', 'ucs4', 'utf16', 'utf16-ge', 'utf16-le', 'utf32', 'utf32-ge', 'utf32-le', 'ascii'¡¤'binary', 'sjis-imode', 'sjis-doti', 'sjis-jsky'.

For distinguishing charsets automatically, specify 'auto'.

For char. encoding, can only be specified by 'base64'. If a char. encoding specified by 'base64', the char. encoding will decode it and will be the string of the class.

When you decode binary, specify 'binary' as a type for charset.

Distinguished automatically at "auto", the charset will be carried by "getcode" method.

'&#xxxx' will be converted to "EMOJI", if used by 'sjis-imode' or 'sjis-doti'.

$str = $s->get

$str: a string(UTF-8)

take out a string with UTF-8.

$code = $s->getcode($str)

$str: a string $code: a string for character sets

Distinguish charsets of a string. Caution! This is not for string codes which is maintained instance!!

Charsets are distinguished automatically by the following algorism.

  1. When found BOM of UTF-32, the charset is utf32.

  2. If you find BOM of UTF-16, the charset is utf16.

  3. If it is proper as UTF-32BE, the charset is utf32-be.

  4. If it is proper as UTF-32LE, the charset is utf32-le.

  5. Without NON_ASCII CHARACTERS, the charset is ascii. (NON_ASCII_CHARACTERS: Not including ctrl. codes except escape sequences.)

  6. Including JIS sequences, the charset is jis.

  7. Including "J-PHONE_EMOJI", the charset is sjis-sky.

  8. If it is proper as EUC, the charset is euc.

  9. If it is proper as SJIS, the charset is sjis.

  10. If it is proper as SJIS and "EMOJI" of i-mode, the charset is sjis-imode.

  11. If it is proper as SJIS and "EMOJI" of dot-i,the charset is sjis-doti.

  12. If it is proper as UTF-8, the charset is utf8.

  13. If it is not proper to any above them, the charset is unknown.

Caused by the algorism, please pay attention to the following.

  • str: UTF-8, possible to be told as SJIS-code.

  • UCS2 is not distinguished automatically.

  • UTF-16 is recognized automatically, only including BOM.

  • "EMOJI" of cellular phone is recognaized, when it is written by binary directly, not by ASCII like &#xxxx; style.

    "EMOJI" of cellular phone is not distinguished automatically, when it is written by ASCII like &#xxxx; style. (If it is only written in &#xxxxx;, the reference code will be different to the style of &#xxxxx;'s one , so, if you convert with folllowing the result of getcode, "EMOJI" will be not transformed accurately.)

$str =$s->conv($ocode, $encode)

$ocode : OUTPUT CHARSET Choose and specify from 'jis', 'sjis', 'euc', 'utf8', 'ucs2', 'ucs4', 'utf16', 'binary'. $encode : a char encoding, can be abbreviated. $str : a string (specified charset)

Taken out after converted to a charset specified a string.

For char. encoding, can only be specified by 'base64'. Specified 'base64', the char.end will be returned to the decoded string.

$->tag2bin

Replace to "&#xxxxx; character", indicated by &#xxxxx; style in a string.

$s->z2h

Convert FULL WIDTH CHARACTERS to HARF WIDTH CHARACTERS.

$s->h2z

Convert HARF WIDTH CHARACTERS to FULL WIDTH CHARACTERS.

$s->hira2kata

Convert HIRAGANA to KATAKANA.

$s->kata2hira

Convert KATAKANA to HIRAGANA.

$str = $s->jis

$str: a string(JIS)

Take out a string by JIS(ISO-2022-JP) CHARSET.

$str = $s->euc

$str: a string(EUC)

Take out a string by EUC CHARSET.

$str = $s->utf8

$str: a string(UTF-8)

Take out a string by UTF-8 CHARSET.

$str = $s->ucs2

$str: a string (UCS2)

Take out a string by UCS2 CHARSET.

$str = $s->ucs4

$str: a string(UCS4)

Take out a string by UCS4 CHARSET.

$str = $s->utf16

$str: a string(UTF-16)

Take out a string by UTF-16 CHARSET. Not accompanied with BOM. Returned to big-endian.

$str = $s->sjis

$str: a string(SJIS)

Take out a string by SJIS(MS-CP932) CHARSET.

$str = $s->sjis_imode

$str: a string(SJIS/imode_EMOJI)

Take out a string by SJIS CHARSET for imode_type.

$str = $s->sjis_doti

$str: a string(SJIS/dot-EMOJI)

Take out a string by SJIS CHARSET for dot_type.

$str = $s->sjis_sky

$str: a string(SJIS/J-SKY_EMOJI)

Take out a string by SJIS CHARSET for j-sky_type.

\@str = $s->strcut($len)

$len: cut char.number

\@STR: a string

Set a full string in the array of shorter strings no longer than the length of the array. Basically, the length of the array is the same length of the shorter string.

$len = $s->strlen

$len: width of a string

When you use 'length()' for UTF-8, FULL WIDTH CHARACTERS become 3 bytes length, however, use this method, FULL WIDTH CHARACTERS will become 2 bytes as SJIS'usual.

$s->join_csv(\@values);

@values: data array

Convert array to a CSV char.string and store into an instance. Add a new paragraph LF (Line Feed) in the end of string.

\@values = $s->split_csv;

@values: data array

Take a string stored in an instance for CSV and set it in an array, LF (Line Feed) right after the string will always be ignored automatically.

UNICODE MAPPING

Drawn Unicode map as following.

SJIS

mapped as MS-CP932. using this table for mapping. See following URL address.

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

Character will be converted to &#xxxxx; format, if cannot be mapped to SJIS from Unicode, and also, every unmapped characters are. However, "EMOJI" of cellular phone are converted to "?".

EUC/JIS

At first, converts to SJIScode, and maps to Unicode. However, there are an untransformed char. to SJIS, the character cannot be mapped appropriatly.

DoCoMo i-mode

Map, a portion of involving "EMOJI" in F8OO - F9FF, into U+OFF8OO - U+OFF9FF.

ASTEL dot-i

Map, a portion of involving "EMOJI" in FOOO - F4FF, into U+OFFOOO - U+OFF4FF.

J-PHONE J-SKY

"J-SKY_EMOJI" are mapped down as follows. "\e\$"(\x1b\x24) escape sequences, the first byte, second byte, and "\xOf". Compressed by drawing second byte's "EMOJI" twice, if the "EMOJI" are same between first and second.

Map 45OO - 47FF into U+OFFBOO - U+OFFDFF, as accounting the first byte and the second one is one character of a pair.

Unicode::Japanese compresses "J-SKY_EMOJI" automatically, when "EMOJI"of consentive bytes are same.

BUGS

  • EUC, JIS cannot be convertted to appropriately, when they include in untransformed characters to SJIS. Because they are converted after gotten SJIS to UTF-8.

  • Japanese.pm file will be broken, if sent with ASCII mode of FTP, because he has a binary in the end of it.

AUTHOR INFORMATION

Copyright 2001, SANO Taku (SAWATARI Mikage) All right resreved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Address bug reports and comments to: mikage@cpan.org. Thank you.

CREDITS

Thanks very much to:

Nao NAKAYAMA

1 POD Error

The following errors were encountered while parsing the POD:

Around line 308:

Non-ASCII character seen before =encoding in ''ascii'¡¤'binary','. Assuming CP1252