NAME
ShiftJIS::String - functions to manipulate Shift-JIS strings
SYNOPSIS
use ShiftJIS::String;
ShiftJIS::String::substr($str, ShiftJIS::String::index($str, $substr));
DESCRIPTION
This module provides some functions which emulate the corresponding CORE
functions and helps someone to manipulate multiple-byte character sequences in Shift-JIS.
* 'Hankaku' and 'Zenkaku' mean 'halfwidth' and 'fullwidth' characters in Japanese, respectively.
FUNCTIONS
Check Whether the String is Legal
issjis(LIST)
-
Returns a boolean indicating whether all the strings in the parameter list are legally encoded in Shift-JIS.
Returns false if
LIST
includes one (or more) invalid string.
Length
Reverse
strrev(STRING)
-
Returns a reversed string, i.e., a string that has all characters of
STRING
but in the opposite order.
Search
index(STRING, SUBSTR)
index(STRING, SUBSTR, POSITION)
-
Returns the position of the first occurrence of
SUBSTR
inSTRING
at or afterPOSITION
. IfPOSITION
is omitted, starts searching from the beginning of the string.If the substring is not found, returns
-1
. rindex(STRING, SUBSTR)
rindex(STRING, SUBSTR, POSITION)
-
Returns the position of the last occurrence of
SUBSTR
inSTRING
. IfPOSITION
is specified, returns the last occurrence at or beforePOSITION
.If the substring is not found, returns
-1
. strspn(STRING, SEARCHLIST)
-
Returns the position of the first occurrence of any character that is not contained in <SEARCHLIST>.
If
STRING
consists of the characters inSEARCHLIST
, the returned value must equal the length ofSTRING
.While
SEARCHLIST
is not aware of character ranges, you can utilizemkrange()
.strspn("+0.12345*12", "+-.0123456789"); # returns 8 (at '*')
strcspn(STRING, SEARCHLIST)
-
Returns the position of the first occurrence of any character contained in
SEARCHLIST
.If
STRING
does not contain any character inSEARCHLIST
, the returned value must equal the length ofSTRING
.While
SEARCHLIST
is not aware of character ranges, you can utilizemkrange()
. rspan(STRING, SEARCHLIST)
-
Searches the last occurence of any character that is not contained in
SEARCHLIST
.If such a character is found, returns the next position to it; otherwise (any character in
STRING
is contained inSEARCHLIST
), it returns0
(as the first position of the string).While
SEARCHLIST
is not aware of character ranges, you can utilizemkrange()
. rcspan(STRING, SEARCHLIST)
-
Searches the last occurence of any character that is contained in
SEARCHLIST
.If such a character is found, returns the next position to it; otherwise (any character in
STRING
is not contained inSEARCHLIST
), it returns0
(as the first position of the string).While
SEARCHLIST
is not aware of character ranges, you can utilizemkrange()
.
Trimming
trim(STRING)
trim(STRING, SEARCHLIST)
trim(STRING, SEARCHLIST, USE_COMPLEMENT)
-
Erases characters in
SEARCHLIST
from the beginning and the end ofSTRING
and the returns the result.If
USE_COMPLEMENT
is true, erases characters that are not contained inSEARCHLIST
.If
SEARCHLIST
is omitted (orundef
), it is used the list of whitespace characters i.e.,"\t"
,"\n"
,"\r"
,"\f"
,"\x20"
(SP
), and"\x81\x40"
(IDSP
).While
SEARCHLIST
is not aware of character ranges, you can utilizemkrange()
, liketrim($string, mkrange("\x00-\x20"))
. ltrim(STRING)
ltrim(STRING, SEARCHLIST)
ltrim(STRING, SEARCHLIST, USE_COMPLEMENT)
-
Erases characters in
SEARCHLIST
from the beginning ofSTRING
and the returns the result.If
USE_COMPLEMENT
is true, erases characters that are not contained inSEARCHLIST
.While
SEARCHLIST
is not aware of character ranges, you can utilizemkrange()
. rtrim(STRING)
rtrim(STRING, SEARCHLIST)
rtrim(STRING, SEARCHLIST, USE_COMPLEMENT)
-
Erases characters in
SEARCHLIST
from the end ofSTRING
and the returns the result.If
USE_COMPLEMENT
is true, erases characters that are not contained inSEARCHLIST
.While
SEARCHLIST
is not aware of character ranges, you can utilizemkrange()
.
Substring
substr(STRING or SCALAR REF, OFFSET)
substr(STRING or SCALAR REF, OFFSET, LENGTH)
substr(SCALAR, OFFSET, LENGTH, REPLACEMENT)
-
It works like
CORE::substr
, but using character semantics of Shift-JIS.If the
REPLACEMENT
as the fourth parameter is specified, replaces parts of theSCALAR
and returns what was there before.You can utilize the lvalue reference, returned if a reference to a scalar variable is used as the first argument.
${ &substr(\$str,$off,$len) } = $replace; works like CORE::substr($str,$off,$len) = $replace;
The returned lvalue is not aware of Shift-JIS, then successive assignment may cause unexpected results.
Get lvalue before any assignment if you are not sure.
Split
strsplit(SEPARATOR, STRING)
strsplit(SEPARATOR, STRING, LIMIT)
-
This function emulates
CORE::split
, but splits on theSEPARATOR
string, not by a pattern. If not in list context, only return the number of fields found, but does not split into the@_
array.If an empty string is specified as
SEPARATOR
, splits the specified string into characters (similarly toCORE::split //, STRING, LIMIT
).strsplit('', 'This is Perl.', 7); # ('T', 'h', 'i', 's', ' ', 'i', 's Perl.')
If an undefined value is specified as
SEPARATOR
, splits the specified string on whitespace characters (includingIDEOGRAPHIC SPACE
). Leading whitespace characters do not produce any field (similarly toCORE::split ' ', STRING, LIMIT
).strsplit(undef, ' This is Perl.'); # ('This', 'is', 'Perl.')
Comparison
strcmp(LEFT-STRING, RIGHT-STRING)
-
Returns
1
(whenLEFT-STRING
is greater thanRIGHT-STRING
) or0
(whenLEFT-STRING
is equal toRIGHT-STRING
) or-1
(whenLEFT-STRING
is lesser thanRIGHT-STRING
).The order is roughly as shown the following list.
JIS X 0201 Roman, JIS X 0201 Kana, then JIS X 0208 Kanji (Zenkaku).
For example,
0x41
as'A'
is lesser than0xB1
(HANKAKU KATAKANA A
).0xB1
is lesser than0x8341
(KATAKANA A
).0x8341
is lesser than0x8383
(KATAKANA SMALL YA
).0x8383
is lesser than0x83B1
(GREEK CAPITAL TAU
).Caveat! Compare the 2nd and the 4th examples. Byte
"\xB1"
is lesser than byte"\x83"
as the leading bytes; while greater as the trailing bytes. Shortly, the ordering as binary is broken for the Shift-JIS codepoint order. strEQ(LEFT-STRING, RIGHT-STRING)
-
Returns a boolean whether
LEFT-STRING
is equal toRIGHT-STRING
.Note:
strEQ
is an expensive equivalence of theCORE
'seq
operator. strNE(LEFT-STRING, RIGHT-STRING)
-
Returns a boolean whether
LEFT-STRING
is not equal toRIGHT-STRING
.Note:
strNE
is an expensive equivalence of theCORE
'sne
operator. strLT(LEFT-STRING, RIGHT-STRING)
-
Returns a boolean whether
LEFT-STRING
is lesser thanRIGHT-STRING
. strLE(LEFT-STRING, RIGHT-STRING)
-
Returns a boolean whether
LEFT-STRING
is lesser than or equal toRIGHT-STRING
. strGT(LEFT-STRING, RIGHT-STRING)
-
Returns a boolean whether
LEFT-STRING
is greater thanRIGHT-STRING
. strGE(LEFT-STRING, RIGHT-STRING)
-
Returns a boolean whether
LEFT-STRING
is greater than or equal toRIGHT-STRING
. strxfrm(STRING)
-
Returns a string transformed so that
CORE:: cmp
can be used for binary comparisons (NOT the length of the transformed string).I.e.
strxfrm($a) cmp strxfrm($b)
is equivalent tostrcmp($a, $b)
, as long as yourcmp
doesn't use any locale other than that of Perl.
Character Range
mkrange(EXPR, EXPR)
-
Returns the character list (not in list context, as a concatenated string) gained by parsing the specified character range.
A character range is specified with a
'-'
(HYPHEN-MINUS
). The backslashed combinations'\-'
and'\\'
are used instead of the characters'-'
and'\'
, respectively. The hyphen at the beginning or end of the range is also evaluated as the hyphen itself.For example,
mkrange('+\-0-9a-fA-F')
returns('+', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f', 'A', 'B', 'C', 'D', 'E', 'F')
.The order of Shift-JIS characters is:
0x00 .. 0x7F, 0xA1 .. 0xDF, 0x8140 .. 0x9FFC, 0xE040 .. 0xFCFC
.If true value is specified as the second parameter, Reverse character ranges such as
'9-0'
,'Z-A'
can be used; otherwise, reverse character ranges are croaked.
Transliteration
strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST)
strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST, MODIFIER)
strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST, MODIFIER, PATTERN)
strtr(STRING or SCALAR REF, SEARCHLIST, REPLACEMENTLIST, MODIFIER, PATTERN, TOPATTERN)
-
Transliterates all occurrences of the characters found in the search list with the corresponding character in the replacement list.
If a reference to a scalar variable is specified as the first argument, returns the number of characters replaced or deleted; otherwise, returns the transliterated string and the specified string is unaffected.
SEARCHLIST and REPLACEMENTLIST
Character ranges (internally utilizing
mkrange()
) are supported.If the
REPLACEMENTLIST
is empty, theSEARCHLIST
is replicated.If the replacement list is shorter than the search list, the final character in the replacement list is replicated till it is long enough (but differently works when the 'd' modifier is used).
MODIFIER
c Complement the SEARCHLIST. d Delete found but unreplaced characters. s Squash duplicate replaced characters. h Returns a hash (or a hashref in scalar context) of histogram R No use of character ranges. r Allows to use reverse character ranges. o Caches the conversion table internally. strtr(\$str, " \x81\x40\n\r\t\f", '', 'd'); # deletes all whitespace characters including IDEOGRAPHIC SPACE.
If
'h'
modifier is specified, returns a hash (or a hashref in scalar context) of histogram (key: a character as a string, value: count), whether the first argument is a reference or not. If you want to get the histogram and the modified string at once, pass a reference as the first argument and use its value after.If
'R'
modifier is specified,'-'
is not evaluated as a meta character butHYPHEN-MINUS
itself like intr'''
. Compare:strtr("90 - 32 = 58", "0-9", "A-J"); # output: "JA - DC = FI" strtr("90 - 32 = 58", "0-9", "A-J", "R"); # output: "JA - 32 = 58" # cf. ($str = "90 - 32 = 58") =~ tr'0-9'A-J'; # '0' to 'A', '-' to '-', and '9' to 'J'.
If
'r'
modifier is specified, you are allowed to use reverse character ranges. For example,strtr($str, "0-9", "9-0", "r")
is equivalent tostrtr($str, "0123456789", "9876543210")
.PATTERN and TOPATTERN
By use of
PATTERN
andTOPATTERN
, you can transliterate the string using lists containing some multi-character substrings.If called with four arguments,
SEARCHLIST
,REPLACEMENTLIST
, andSTRING
are splited characterwise;If called with five arguments, a multi-character substring that matchs
PATTERN
inSEARCHLIST
,REPLACEMENTLIST
, orSTRING
is regarded as an transliteration unit.If both
PATTERN
andTOPATTERN
are specified, a multi-character substring either that matchsPATTERN
inSEARCHLIST
, orSTRING
, or that matchsTOPATTERN
inREPLACEMENTLIST
is regarded as an transliteration unit.print strtr( "Caesar Aether Goethe", "aeoeueAeOeUe", "äööÄÖÜ", "", "[aouAOU]e", "&[aouAOU]uml;"); # output: Cäsar Äther Göthe
LISTs as Anonymous Arrays
Instead of specification of
PATTERN
andTOPATTERN
, you can use anonymous arrays asSEARCHLIST
and/orREPLACEMENTLIST
as follows.print strtr( "Caesar Aether Goethe", [qw/ae oe ue Ae Oe Ue/], [qw/ä ö ö Ä Ö Ü/] );
Caching the conversion table
If
'o'
modifier is specified, the conversion table is cached internally. e.g.foreach (@strings) { print strtr($_, $from_list, $to_list, 'o'); }
will be almost as efficient as this:
$closure = trclosure($from_list, $to_list); foreach (@strings) { print &$closure($_); }
You can use whichever you like.
Without
'o'
,foreach (@strings) { print strtr($_, $from_list, $to_list); }
will be very slow since the conversion table is made whenever the function is called.
Generation of the Closure to Transliterate
trclosure(SEARCHLIST, REPLACEMENTLIST)
trclosure(SEARCHLIST, REPLACEMENTLIST, MODIFIER)
trclosure(SEARCHLIST, REPLACEMENTLIST, MODIFIER, PATTERN)
trclosure(SEARCHLIST, REPLACEMENTLIST, MODIFIER, PATTERN, TOPATTERN)
-
Returns a closure to transliterate the specified string. The return value is an only code reference, not blessed object. By use of this code ref, you can save yourself time as you need not specify the parameter list every time.
The functionality of the closure made by
trclosure()
is equivalent to that ofstrtr()
. Frankly speaking, thestrtr()
callstrclosure()
internally and uses the returned closure.
Case of the Alphabet
toupper(STRING)
toupper(SCALAR REF)
-
Returns an uppercased string of
STRING
. Converts only half-width Latin charactersa-z
toA-Z
.If a reference of scalar variable is specified as the first argument, the string referred to it is uppercased and the number of characters replaced is returned.
tolower(STRING)
tolower(SCALAR REF)
-
Returns a lowercased string of
STRING
. Converts only half-width Latin charactersA-Z
toa-z
.If a reference of scalar variable is specified as the first argument, the string referred to it is lowercased and the number of characters replaced is returned.
Conversion between hiragana and katakana
If a reference to a scalar variable is specified as the first argument, the string referred to it is converted and the number of characters replaced is returned. Otherwise, returns a string converted and the specified string is unaffected.
- Note
-
The conversion between a voiced (or semi-voiced) hiragana and katakana (a single character), and halfwidth katakana with a voiced or semi-voiced mark (a sequence of two characters) is counted as
1
. Similarly, the conversion between hiragana VU, represented by two characters (hiragana U + voiced mark), and katakana VU or halfwidth katakana VU is counted as1
.Conversion concerning halfwidth katakana includes halfwidth symbols:
HALFWIDTH IDEOGRAPHIC FULL STOP
,HALFWIDTH LEFT CORNER BRACKET
,HALFWIDTH RIGHT CORNER BRACKET
,HALFWIDTH IDEOGRAPHIC COMMA
,HALFWIDTH KATAKANA MIDDLE DOT
,HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK
,HALFWIDTH KATAKANA VOICED SOUND MARK
,HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK
. Conversion between hiragana and katakana includes those between hiragana iteration marks and katakana iteration marks.Hiragana WI, WE, small WA and katakana WI, WE, small WA, small KA, small KE will be regarded as hiragana I, E, WA and katakana I, E, WA, KA, KE if the fallback conversion is necessary.
kanaH2Z(STRING)
kanaH2Z(SCALAR REF)
-
Converts Halfwidth Katakana to Katakana. Hiragana are not affected.
kataH2Z(STRING)
kataH2Z(SCALAR REF)
-
Converts Halfwidth Katakana to Katakana. Hiragana are not affected.
Note:
kataH2Z
is an alias ofkanaH2Z
. hiraH2Z(STRING)
hiraH2Z(SCALAR REF)
-
Converts Halfwidth Katakana to Hiragana. Katakana are not affected.
kataZ2H(STRING)
kataZ2H(SCALAR REF)
-
Converts Katakana to Halfwidth Katakana. Hiragana are not affected.
kanaZ2H(STRING)
kanaZ2H(SCALAR REF)
-
Converts Hiragana to Halfwidth Katakana, and Katakana to Halfwidth Katakana.
hiraZ2H(STRING)
hiraZ2H(SCALAR REF)
-
Converts Hiragana to Halfwidth Katakana. Katakana are not affected.
hiXka(STRING)
hiXka(SCALAR REF)
-
Converts Hiragana to Katakana and Katakana to Hiragana at once. Halfwidth Katakana are not affected.
hi2ka(STRING)
hi2ka(SCALAR REF)
-
Converts Hiragana to Katakana. Halfwidth Katakana are not affected.
ka2hi(STRING)
ka2hi(SCALAR REF)
-
Converts Katakana to Hiragana. Halfwidth Katakana are not affected.
Conversion of Whitespace Characters
If a reference to a scalar variable is specified as the first argument, the string referred to it is converted and the number of characters replaced is returned. Otherwise, returns a string converted and the specified string is unaffected.
spaceH2Z(STRING)
spaceH2Z(SCALAR REF)
-
Converts
"\x20"
(space) to"\x81\x40"
(ideographic space). spaceZ2H(STRING)
spaceZ2H(SCALAR REF)
-
Converts
"\x81\x40"
(ideographic space) to"\x20"
(space).
CAVEAT
A legal Shift-JIS character in this module must match the following regular expression:
[\x00-\x7F\xA1-\xDF]|[\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]
Any string from an external source should be checked by issjis()
function, excepting you know it is surely coded in Shift-JIS.
Use of an illegal Shift-JIS string may lead to odd results.
Some Shift-JIS double-byte characters have a trailing byte in the range of [\x40-\x7E]
, viz.,
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
The Perl lexer (parhaps) doesn't take any care to these bytes, so they sometimes make trouble. For example, the quoted literal ending with a double-byte character whose trailing byte is 0x5C
causes a fatal error, since the trailing byte 0x5C
backslashes the closing quote.
Such a problem doesn't arise when the string is gotten from any external resource. But writing the script containing Shift-JIS double-byte characters needs the greatest care.
The use of single-quoted heredoc, << ''
, or \xhh
meta characters is recommended in order to define a Shift-JIS string literal.
The safe ASCII-graphic characters, [\x21-\x3F]
, are:
!"#$%&'()*+,-./0123456789:;<=>?
They are preferred as the delimiter of quote-like operators.
BUGS
This module supposes $[
is always equal to 0, never 1.
AUTHOR
SADAHIRO Tomoyuki <SADAHIRO@cpan.org>
Copyright(C) 2001-2007, SADAHIRO Tomoyuki. Japan. All rights reserved.
This module is free software; you can redistribute it
and/or modify it under the same terms as Perl itself.