The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Unicode::Japanese - Japanese Character Encoding Handler

SYNOPSIS

use Unicode::Japanese;

# convert utf8 -> sjis

print Unicode::Japanese->new($str)->sjis;

# convert sijs -> utf8

print Unicode::Japanese->new($str,'sjis')->get;

# convert sjis (imode_EMOJI) -> utf8

print Unicode::Japanese->new($str,'sjis-imode')->get;

# convert ZENKAKU (utf8) -> HANKAKU (utf8)

print Unicode::Japanese->new($str)->z2h->get;

DESCRIPTION

Module for conversion among Japanese character encodings.

FEATURES

  • The instance stores internal strings in UTF-8.

  • Supports both XS and Non-XS. Use XS for high performance, or No-XS for ease to use (only by copying Japanese.pm).

  • Supports conversion between ZENKAKU and HANKAKU.

  • Safely handles "EMOJI" of the mobile phones (DoCoMo i-mode, ASTEL dot-i and J-PHONE J-Sky) by mapping them on Unicode Private Use Area.

  • Supports conversion of the same image of EMOJI between different mobile phone's standard mutually.

  • Considers Shift_JIS(SJIS) as MS-CP932. (Shift_JIS on MS-Windows (MS-SJIS/MS-CP932) differ from generic Shift_JIS encodings.)

  • On converting Unicode to SJIS (EUC/JIS), those encodings that cannot be converted to SJIS (except "EMOJI") are escaped in "&#xxxxx;" format.

METHODS

$s = Unicode::Japanese->new($str [, $icode [, $encode]])

Creates a new instance of Unicode::Japanese.

If arguments are specified, passes through to set method.

$s->set($str [, $icode [, $encode]])
$str: string
$icode: character encodings, may be omitted (default = 'utf8')
$encode: ASCII encoding, may be omitted.

Set a string in the instance. If '$icode' is omitted, string is considerd as UTF-8.

To specify a encodings, choose from the following; 'jis', 'sjis', 'euc', 'utf8', 'ucs2', 'ucs4', 'utf16', 'utf16-ge', 'utf16-le', 'utf32', 'utf32-ge', 'utf32-le', 'ascii', 'binary', 'sjis-imode', 'sjis-doti', 'sjis-jsky'.

'&#xxxxx' will be converted to "EMOJI", when specified 'sjis-imode' or 'sjis-doti'.

For auto encoding detection, you MUST specify 'auto' so as to call getcode() method automatically.

For ASCII encoding, only 'base64' may be specified. With it, the string will be decoded before storing.

To decode binary, specify 'binary' as the encoding.

$str = $s->get
$str: string (UTF-8)

Gets a string with UTF-8.

$code = $s->getcode($str)
$str: string
$code: character encoding name

Detects the character encodings of $str.

Notice: The code of the string in the instance is NOT detected.

Character encodings are distinguished by the following algorism:

  1. If BOM of UTF-32 is found, the encoding is utf32.

  2. If BOM of UTF-16 is found, the encoding is utf16.

  3. If it is in proper UTF-32BE, the encoding is utf32-be.

  4. If it is in proper UTF-32LE, the encoding is utf32-le.

  5. Without NON-ASCII characters, the encoding is ascii. (control codes except escape sequences has been included in ASCII)

  6. If it includes ISO-2022-JP(JIS) escape sequences, the encoding is jis.

  7. If it includes "J-PHONE EMOJI", the encoding is sjis-sky.

  8. If it is in proper EUC, the encoding is euc.

  9. If it is in proper SJIS, the encoding is sjis.

  10. If it is in proper SJIS and "EMOJI" of i-mode, the encoding is sjis-imode.

  11. If it is in proper SJIS and "EMOJI" of dot-i,the encoding is sjis-doti.

  12. If it is in proper UTF-8, the encoding is utf8.

  13. If none above is true, the encoding is unknown.

Regarding the algorism, pay attention to the following:

  • UTF-8 is occasionally detected as SJIS.

  • Can NOT detect UCS2 automatically.

  • Can detect UTF-16 only when the string has BOM.

  • Can detect "EMOJI" when it is stored in binary, not in "&#xxxxx;" format. (If only stored in "&#xxxxx;" format, getcode() will return incorrect result. In that case, "EMOJI" will be crashed.)

$str = $s->conv($ocode, $encode)
$ocode: output character encoding (Choose from 'jis', 'sjis', 'euc', 'utf8', 'ucs2', 'ucs4', 'utf16', 'binary')
$encode: ASCII encoding, may be omitted.
$str: string

Gets a string converted to $ocode.

For ASCII encoding, only 'base64' may be specified. With it, the string encoded in base64 will be returned.

$s->tag2bin

Replaces the substrings "&#xxxxx;" in the string with the binary entity they mean.

$s->z2h

Converts ZENKAKU to HANKAKU.

$s->h2z

Converts HANKAKU to ZENKAKU.

$s->hira2kata

Converts HIRAGANA to KATAKANA.

$s->kata2hira

Converts KATAKANA to HIRAGANA.

$str = $s->jis

$str: string (JIS)

Gets the string converted to ISO-2022-JP(JIS).

$str = $s->euc

$str: string (EUC)

Gets the string converted to EUC.

$str = $s->utf8

$str: string (UTF-8)

Gets the string converted to UTF-8.

$str = $s->ucs2

$str: string (UCS2)

Gets the string converted to UCS2.

$str = $s->ucs4

$str: string (UCS4)

Gets the string converted to UCS4.

$str = $s->utf16

$str: string (UTF-16)

Gets the string converted to UTF-16(big-endian). BOM is not added.

$str = $s->sjis

$str: string (SJIS)

Gets the string converted to Shift_JIS(MS-SJIS/MS-CP932).

$str = $s->sjis_imode

$str: string (SJIS/imode_EMOJI)

Gets the string converted to SJIS for i-mode.

$str = $s->sjis_doti

$str: string (SJIS/dot-i_EMOJI)

Gets the string converted to SJIS for dot-i.

$str = $s->sjis_sky

$str: string (SJIS/J-SKY_EMOJI)

Gets the string converted to SJIS for j-sky.

@str = $s->strcut($len)
$len: number of characters
@str: strings

Splits the string by length($len).

$len = $s->strlen

$len: `visual width' of the string

Gets the length of the string. This method has been offered to substitute for perl build-in length(). ZENKAKU characters are assumed to have lengths of 2, regardless of the coding being SJIS or UTF-8.

$s->join_csv(@values);

@values: data array

Converts the array to a string in CSV format, then stores into the instance. In the meantime, adds a newline("\n") at the end of string.

@values = $s->split_csv;

@values: data array

Splits the string, accounting it is in CSV format. Each newline("\n") is removed before split.

DESCRIPTION OF UNICODE MAPPING

SJIS

Mapped as MS-CP932. Mapping table in the following URL is used.

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

If a character cannot be mapped to SJIS from Unicode, it will be converted to &#xxxxx; format.

Also, any unmapped character will be converted into "?" when converting to SJIS for mobile phones.

EUC/JIS

Converted to SJIS and then mapped to Unicode. Any non-SJIS character in the string will not be mapped correctly.

DoCoMo i-mode

Portion of involving "EMOJI" in F800 - F9FF is maapped to U+0FF800 - U+0FF9FF.

ASTEL dot-i

Portion of involving "EMOJI" in F000 - F4FF is maapped to U+0FF000 - U+0FF4FF.

J-PHONE J-SKY

"J-SKY EMOJI" are mapped down as follows: "\e\$"(\x1b\x24) escape sequences, the first byte, the second byte and "\x0f". With sequential "EMOJI"s of identical first bytes, it may be compressed by arranging only the second bytes.

4500 - 47FF is mapped to U+0FFB00 - U+0FFDFF, accounting the first and the second bytes make one EMOJI character.

Unicode::Japanese will compress "J-SKY_EMOJI" automatically when the first bytes of a sequence of "EMOJI" are identical.

BUGS

  • EUC, JIS strings cannot be converted correctly when they include non-SJIS characters because they are converted to SJIS before being converted to UTF-8.

  • Some characters of CP932 they not in standard Shift_JIS (ex; not in Joyo Kanji) will not be detected and converted.

    When string include such non-standard Shift_JIS, they will not detected as SJIS. Also, getcode() and all convert method will not work correctly.

  • When useing XS, character encoding detection of EUC-JP and SJIS(included all EMOJI) strings when they include "\e" will fail. Also, getcode() and all convert method will not work.

  • The Japanese.pm file will collapse if sent via ASCII mode of FTP, as it has a trailing binary data.

AUTHOR INFORMATION

Copyright 2001-2002 SANO Taku (SAWATARI Mikage) and YAMASHINA Hio. All right resreved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Bug reports and comments to: mikage@cpan.org. Thank you.

CREDITS

Thanks very much to:

NAKAYAMA Nao

SUGIURA Tatsuki & Debian JP Project