NAME
ShiftJIS::Regexp - regular expressions in Shift-JIS
SYNOPSIS
use ShiftJIS::Regexp qw(:all);
match($string, '\p{Hiragana}{2}\p{Digit}{2}');
match($string, '\pH{2}\pD{2}');
# these two are equivalent:
DESCRIPTION
This module provides some functions to use regular expressions in Shift-JIS on the byte-oriented perl.
The legal Shift-JIS character in this module must match the following regular expression:
[\x00-\x7F\xA1-\xDF]|[\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]
Therefore this module can't handle the addition of single-byte characters ([\x80\xA0\xFD-\xFF]
) for MacOS Japanese.
To avoid false matching in multibyte encoding, this module uses anchoring technique to ensure each matching position places at the character boundaries. cf. perlfaq6, "How can I match strings with multibyte characters?"
See also "Avoiding Mismatching" below.
Functions
re(PATTERN)
re(PATTERN, MODIFIER)
-
Returns a regular expression parsable by the byte-oriented perl.
PATTERN
is specified as a string.MODIFIER
is specified as a string. Modifiers in the following list are allowed.i case-insensitive pattern (only for ascii alphabets) I case-insensitive pattern (greek, cyrillic, fullwidth latin) j hiragana-katakana-insensitive pattern (but halfwidth katakana are not considered.) s treat string as single line m treat string as multiple lines x ignore whitespace (i.e. [\x20\n\r\t\f]) unless backslashed or inside a character class; but comments are not recognized! o once parsed (not compiled!) and the result is cached internally.
o
modifierwhile (<DATA>) { print replace($_, '(perl)', '<strong>$1</strong>', 'igo'); } is more efficient than while (<DATA>) { print replace($_, '(perl)', '<strong>$1</strong>', 'ig'); } because in the latter case the pattern is parsed every time whenever the function is called.
match(STRING, PATTERN)
match(STRING, PATTERN, MODIFIER)
-
An emulation of
m//
operator aware of Shift-JIS. But, to emulate@list = $string =~ m/PATTERN/g
, the pattern should be parenthesized (capturing parentheses are not added automatically).@list = match($string, '\pH', 'g'); # wrong; returns garbage! @list = match($string,'(\pH)','g'); # good
PATTERN
is specified as a string.MODIFIER
is specified as a string.i,I,j,s,m,x,o please see re(). g match globally z tell the function the pattern matches an empty string (sorry, due to the poor auto-detection)
replace(STRING or SCALAR REF, PATTERN, REPLACEMENT)
replace(STRING or SCALAR REF, PATTERN, REPLACEMENT, MODIFIER)
-
An emulation of
s///
operator but aware of Shift-JIS.If a reference to a scalar is specified as the first argument, substitutes the referent scalar and returns the number of substitutions made. If a string (not a reference) is specified as the first argument, returns the substituted string and the specified string is unaffected.
MODIFIER
is specified as a string.i,I,j,s,m,x,o please see re(). g,z please see match().
jsplit(PATTERN or ARRAY REF of [PATTERN, MODIFIER], STRING)
jsplit(PATTERN or ARRAY REF of [PATTERN, MODIFIER], STRING, LIMIT)
-
An emulation of
CORE::split
but aware of Shift-JIS.In scalar/void context, it does not split into the
@_
array; in scalar context, only returns the number of fields found.PATTERN
is specified as a string. But' '
asPATTERN
has no special meaning; it splits the string on a single space similarly toCORE::split / /
.When you want to split the string on whitespace, pass an undefined value as
PATTERN
or use thesplitspace()
function.jsplit(undef, " \x81\x40 This is \x81\x40 perl."); splitspace(" \x81\x40 This is \x81\x40 perl."); # ('This', 'is', 'perl.')
If you want to pass pattern with modifiers, specify an arrayref of
[PATTERN, MODIFIER]
as the first argument. You can also use "Embedded Modifiers").MODIFIER
is specified as a string.i,I,j,s,m,x,o please see re().
splitspace(STRING)
splitspace(STRING, LIMIT)
-
This function emulates
CORE::split(' ', STRING, LIMIT)
. It returns a list given by splitSTRING
on whitespace including"\x81\x40"
(IDEOGRAPHIC SPACE). Leading whitespace characters do not produce any field.Note:
splitspace(STRING, LIMIT)
is equivalent tojsplit(undef, STRING, LIMIT)
. splitchar(STRING)
splitchar(STRING, LIMIT)
-
This function emulates
CORE::split(//, STRING, LIMIT)
. It returns a list given by split ofSTRING
into characters.Note:
splitchar(STRING, LIMIT)
is equivalent tojsplit('', STRING, LIMIT)
.
Basic Regular Expressions
regexp meaning
^ match the start of the string
match the start of any line with 'm' modifier
$ match the end of the string, or before newline at the end
match the end of any line with 'm' modifier
. match any character except \n
match any character with 's' modifier
\A only at beginning of string
\Z at the end of the string, or before newline at the end
\z only at the end of the string (eq. '(?!\n)\Z')
\C match a single C char (octet), i.e. [\0-\xFF] in perl.
\j match any character, i.e. [\0-\x{FCFC}] in this module.
\J match any character except \n, i.e. [^\n] in this module.
* \j and \J are extensions by this module. e.g.
match($_, '(\j{5})\z') returns last five chars including \n at the end
match($_, '(\J{5})\Z') returns last five chars excluding \n at the end
Metacharacters
\a alarm (BEL)
\b backspace (BS) * within character classes *
\e escape (ESC)
\f form feed (FF)
\n newline (LF)
\r return (CR)
\t tab (HT)
\0 null (NUL)
\ooo octal single-byte character
\xhh hexadecimal single-byte character
\x{hhhh} hexadecimal double-byte character
\c[ control character
e.g. \012 \123 \x5c \x5C \x{824F} \x{9Fae} \cA \cZ \c^ \c?
Character Classes
A character class can include literal characters, metacharacters, and predefined character classes. Ranges in character class are supported. The endpoints of a range are specified by literal characters or metacharacters.
The order of Shift-JIS characters is: 0x00 .. 0x7F, 0xA1 .. 0xDF, 0x8140 .. 0x9FFC, 0xE040 .. 0xFCFC
.
It is no need for users to be conscious of legal ranges of leading and trailing bytes in Shift-JIS, as this module properly skips illegal byte sequences when a character range is to be expanded. For example [\x{8340}-\x{8396}]
is equivalent to [\x{8340}-\x{837E}\x{8380}-\x{8396}]
, since 0x7F
is illegal as the trailing byte in Shift-JIS. So [\0-\x{fcfc}]
matches any one Shift-JIS character. In character classes, any character or byte sequence that does not match any one Shift-JIS character (say, re('[\xA0-\xFF]')
) is croaked.
Character classes that match non-Shift-JIS substring are not supported (use \C
or alternation).
Character Equivalences
Since the version 0.13, the POSIX character equivalence classes [=x=]
are supported, where x can be any character literal or meta chatacter (\xhh
, \x{hhhh}
) that belongs to the character equivalents can be used. have identical meanings. Character equivalence classes are used in a character class.
A kana collation symbol which may be voiced/semi-voiced includes a sequence(s) of two characters of voiced/semi-voiced in halfwidth katakana.
[[===]]
matches EQUALS SIGN
or FULLWIDTH EQUALS SIGN
; [[=[=]]
matches LEFT SQUARE BRACKET
or FULLWIDTH LEFT SQUARE BRACKET
; [[=]=]]
matches RIGHT SQUARE BRACKET
or FULLWIDTH RIGHT SQUARE BRACKET
; [[=\=]]
matches YEN SIGN
or FULLWIDTH YEN SIGN
.
Predefined Character Classes
Normal Abbrev. POSIX definition by characters and ranges
\d [0-9]
\D [^0-9]
\w [0-9A-Z_a-z]
\W [^0-9A-Z_a-z]
\s [\t\n\r\f ]
\S [^\t\n\r\f ]
\p{Xdigit} \pX [[:xdigit:]] [0-9A-Fa-f]
\p{Digit} \pD [[:digit:]] [0-9\x{824F}-\x{8258}]
\p{Upper} \pU [[:upper:]] [A-Z\x{8260}-\x{8279}]
\p{Lower} \pL [[:lower:]] [a-z\x{8281}-\x{829A}]
\p{Alpha} \pA [[:alpha:]] [\p{Upper}\p{Lower}]
\p{Alnum} \pQ [[:alnum:]] [\p{Alpha}\p{Digit}]
\p{Word} \pW [[:word:]] [_\p{Digit}\p{European}\p{Kana}\p{Kanji}]
\p{Punct} \pP [[:punct:]] [!-/:-@[-`{-~\xA1-\xA5\x{8141}-\x{8149}\x{814C}-\x{8151}
\x{815C}-\x{81AC}\x{81B8}-\x{81BF}\x{81C8}-\x{81CE}
\x{81DA}-\x{81E8}\x{81F0}-\x{81F7}\x{81FC}\x{849F}-\x{84BE}]
\p{Graph} \pG [[:graph:]] [\p{Word}\p{Punct}]
\p{Print} \pT [[:print:]] [\x20\x{8140}\p{Graph}]
\p{Space} \pS [[:space:]] [\x20\x{8140}\x09-\x0D]
\p{Blank} \pB [[:blank:]] [\x20\x{8140}\t]
\p{Cntrl} \pC [[:cntrl:]] [\x00-\x1F\x7F]
\p{ASCII} [[:ascii:]] [\x00-\x7F]
\p{Roman} \pR [[:roman:]] [\x21-\x7E]
\p{Hankaku} \pY [[:hankaku:]] [\xA1-\xDF]
\p{Zenkaku} \pZ [[:zenkaku:]] [\x{8140}-\x{FCFC}]
( \p{^Zenkaku} \p^Z [[:^zenkaku:]] [\x00-\x7F\xA1-\xDF] )
\p{Halfwidth} [[:halfwidth:]] [!#$%&()*+,./0-9:;<=>?@A-Z\[\x5c\]^_`a-z{|}~]
\p{Fullwidth} \pF [[:fullwidth:]] [\x{8143}\x{8144}\x{8146}-\x{8149}\x{814D}\x{814F}-\x{8151}
\x{815E}\x{8162}\x{8169}\x{816A}\x{816D}-\x{8170}\x{817B}
\x{8181}\x{8183}\x{8184}\x{818F}\x{8190}\x{8193}-\x{8197}
\x{824F}-\x{8258}\p{FullLatin}]
\p{X0201} [[:x0201:]] [\x20-\x7F\xA1-\xDF]
\p{X0208} [[:x0208:]] [\x{8140}-\x{81AC}\x{81B8}-\x{81BF}\x{81C8}-\x{81CE}
\x{81DA}-\x{81E8}\x{81F0}-\x{81F7}\x{81FC}\x{824F}-\x{8258}
\p{FullLatin}\x{829F}-\x{82F1}\x{8340}-\x{8396}
\p{Greek}\p{Cyrillic}\p{BoxDrawing}\p{Kanji1}\p{Kanji2}]
\p{X0211} [[:x0211:]] [\x00-\x1F]
\p{JIS} \pJ [[:jis:]] [\p{X0201}\p{X0208}\p{X0211}]
\p{NEC} \pN [[:nec:]] [\x{8740}-\x{875D}\x{875F}-\x{8775}\x{877E}-\x{879C}
\x{ED40}-\x{EEEC}\x{EEEF}-\x{EEFC}]
\p{IBM} \pI [[:ibm:]] [\x{FA40}-\x{FC4B}]
\p{Vendor} \pV [[:vendor:]] [\p{NEC}\p{IBM}]
\p{MSWin} \pM [[:mswin:]] [\p{JIS}\p{Vendor}]
\p{Latin} [[:latin:]] [A-Za-z]
\p{FullLatin} [[:fulllatin:]] [\x{8260}-\x{8279}\x{8281}-\x{829A}]
\p{Greek} [[:greek:]] [\x{839F}-\x{83B6}\x{83BF}-\x{83D6}]
\p{Cyrillic} [[:cyrillic:]] [\x{8440}-\x{8460}\x{8470}-\x{8491}]
\p{European} \pE [[:european:]] [\p{Latin}\p{FullLatin}\p{Greek}\p{Cyrillic}]
\p{HalfKana} [[:halfkana:]] [\xA6-\xDF]
\p{Hiragana} \pH [[:hiragana:]] [\x{829F}-\x{82F1}\x{814A}\x{814B}\x{8154}\x{8155}]
\p{Katakana} \pK [[:katakana:]] [\x{8340}-\x{8396}\x{815B}\x{8152}\x{8153}]
\p{FullKana} [[:fullkana:]] [\p{Hiragana}\p{Katakana}]
\p{Kana} [[:kana:]] [\p{HalfKana}\p{FullKana}]
\p{Kanji0} \p0 [[:kanji0:]] [\x{8156}-\x{815A}]
\p{Kanji1} \p1 [[:kanji1:]] [\x{889F}-\x{9872}]
\p{Kanji2} \p2 [[:kanji2:]] [\x{989F}-\x{EAA4}]
\p{Kanji} [[:kanji:]] [\p{Kanji0}\p{Kanji1}\p{Kanji2}]
\p{BoxDrawing} [[:boxdrawing:]] [\x{849F}-\x{84BE}]
\p{Halfwidth}
matches an ASCII graphic character excludingQUOTATION MARK
,APOSTROPHE
, andHYPHEN-MINUS
.\p{Fullwidth}
matches a double-byte character corresponding to\p{Halfwidth}
. Note: the\p{Fullwidth}
character for0x5C
(\
) isFULLWIDTH YEN SIGN
and that for0x7E
(~
) isFULLWIDTH MACRON
.\p{MSWin}
matches a character of Microsoft CP932.\p{NEC}
matches an NEC special character or an NEC-selected IBM extended character.\p{IBM}
matches an IBM extended character.\p{Vendor}
matches a character of vendor-defined characters in Microsoft CP932, i.e. equivalent to[\p{NEC}\p{IBM}]
.\p{Kanji0}
matches a kanji of the minimum kanji class of JIS X 4061;\p{Kanji1}
matches a kanji of the level 1 kanji of JIS X 0208;\p{Kanji2}
matches a kanji of the level 2 kanji of JIS X 0208;\p{Kanji}
matches a kanji of the basic kanji class of JIS X 4061.\p{Prop}
,\P{^Prop}
,[\p{Prop}]
, etc. are equivalent to each other; and their complements are\P{Prop}
,\p{^Prop}
,[\P{Prop}]
,[^\p{Prop}]
, etc.\pP
,\P^P
,[\pP]
, etc. are equivalent to each other; and their complements are\PP
,\p^P
,[\PP]
,[^\pP]
, etc.[[:class:]]
is equivalent to[^[:^class:]]
; and their complements are[[:^class:]]
or[^[:class:]]
.In
\p{Prop}
,\P{Prop}
,[:class:]
expressions,Prop
andclass
are case-insensitive. E.g.\p{digit}
,[:BoxDrawing:]
, etc. are also accepted. PrefixesIs
andIn
for\p{Prop}
and\P{Prop}
(e.g.\p{IsProp}
,\P{InProp}
, etc.) are optional. But\p{isProp}
,\p{ISProp}
, etc. are not ok, since the prefixesIs
andIn
are not case-insensitive.
Examples of Character Classes
- Kanji
-
Level 1 and 2 kanji by JIS X 0208:1997; [\x{889F}-\x{9872}\x{989F}-\x{EAA4}] Level 3 kanji by JIS X 0213:2004; [\x{879F}-\x{889E}\x{9873}-\x{989E}\x{EAA5}-\x{EFFC}] Level 4 kanji by JIS X 0213:2004; [\x{F040}-\x{FCF4}] Level 1 to 3 kanji by JIS X 0213:2004; [\x{879F}-\x{EFFC}] Level 1 to 4 kanji by JIS X 0213:2004; [\x{879F}-\x{FCF4}] Kanji in NEC-selected IBM extended chars; [\x{ED40}-\x{EEEC}] Kanji in IBM extended characters; [\x{FA5C}-\x{FC4B}]
- JIS X 0213:2004
-
Assigned; [\x{8140}-\x{82F9}\x{8340}-\x{84DC}\x{84E5}-\x{84FA} \x{8540}-\x{86F1}\x{86FB}-\x{8776}\x{877E}-\x{878F} \x{8793}\x{8798}\x{8799}\x{879D}-\x{FCF4}] Unassigned; [\x{82FA}-\x{82FC}\x{84DD}-\x{84E4}\x{84FB}\x{84FC} \x{86F2}-\x{86FA}\x{8777}-\x{877D}\x{8790}-\x{8792} \x{8794}-\x{8797}\x{879A}-\x{879C}\x{FCF5}-\x{FCFC}] Assigned (plain 1); [\x{8140}-\x{82F9}\x{8340}-\x{84DC}\x{84E5}-\x{84FA} \x{8540}-\x{86F1}\x{86FB}-\x{8776}\x{877E}-\x{878F} \x{8793}\x{8798}\x{8799}\x{879D}-\x{EFFC}] Unassigned (plain 1); [\x{82FA}-\x{82FC}\x{84DD}-\x{84E4}\x{84FB}\x{84FC} \x{86F2}-\x{86FA}\x{8777}-\x{877D}\x{8790}-\x{8792} \x{8794}-\x{8797}\x{879A}-\x{879C}] Addition in 2004; [\x{879F}\x{889E}\x{9873}\x{989E}\x{EAA5}\x{EFF8}-\x{EFFC}]
- User-defined characters
-
Windows CP-932: [\x{F040}-\x{F9FC}] MacOS Japanese: [\x{F040}-\x{FCFC}]
- Circled Digits and Numbers
-
Circled 1-50 by JIS X 0213; [\x{8740}-\x{8753}\x{84BF}-\x{84DC}] Circled 1-20 in NEC special chars; [\x{8740}-\x{8753}] Circled 1-20 in MacOS Japanese; [\x{8540}-\x{8553}] Double Circled 1-10 by JIS X 0213; [\x{83D8}-\x{83E1}] Negative Circled 1-20 by JIS X 0213; [\x{869F}-\x{86B2}] Negative Circled 1-9 in MacOS Japanese; [\x{857C}-\x{8585}]
- Roman Numerals
-
Capital I-XII by JIS X 0213; [\x{8754}-\x{875E}\x{8776}] Capital I-X in NEC special chars; [\x{8754}-\x{875D}] Capital I-X in IBM extended characters; [\x{FA4A}-\x{FA53}] Capital I-XV in MacOS Japanese; [\x{859F}-\x{85AD}] Small i-xii by JIS X 0213; [\x{86B3}-\x{86BE}] Small i-x in NEC-selected IBM extended chars; [\x{EEEF}-\x{EEF8}] Small i-x in IBM extended characters; [\x{FA40}-\x{FA49}] Small i-xv in MacOS Japanese; [\x{85B3}-\x{85C1}]
- Double-Byte Characters for ASCII Graphic Characters
-
JIS X 0213; [\x{8149}\x{81AE}\x{8194}\x{8190}\x{8193}\x{8195}\x{81AD} \x{8169}\x{816A}\x{8196}\x{817B}\x{8143}\x{81AF}\x{8144} \x{815E}\x{824F}-\x{8258}\x{8146}\x{8147}\x{8183}\x{8181} \x{8184}\x{8148}\x{8197}\x{8260}-\x{8279}\x{816D}\x{815F} \x{816E}\x{814F}\x{8151}\x{814D}\x{8281}-\x{829A}\x{816F} \x{8162}\x{8170}\x{81B0}] Windows CP-932; [\x{8149}\x{FA57}\x{8194}\x{8190}\x{8193}\x{8195}\x{FA56} \x{8169}\x{816A}\x{8196}\x{817B}\x{8143}\x{817C}\x{8144} \x{815E}\x{824F}-\x{8258}\x{8146}\x{8147}\x{8183}\x{8181} \x{8184}\x{8148}\x{8197}\x{8260}-\x{8279}\x{816D}\x{815F} \x{816E}\x{814F}\x{8151}\x{814D}\x{8281}-\x{829A}\x{816F} \x{8162}\x{8170}\x{8160}]
Note: here, the character for ASCII
0x5C
isREVERSE SOLIDUS
(orFULLWIDTH REVERSE SOLIDUS
) and the character for ASCII0x7E
isTILDE
(orFULLWIDTH TILDE
).
Code Embedded in a Regular Expression (Perl 5.005 or later)
Parsing (?{ ... })
or (??{ ... })
assertions is carried out without any special care of double-byte characters.
(?{ ... })
or (??{ ... })
assertions are disallowed in match()
or replace()
function by perl due to security concerns. Use them via re()
function inside your scope.
Embedded Modifiers
Since version 0.15, embedded modifiers are extended.
An embedded modifier, (?iIjsmxo)
, that appears at the beginning of the 'regexp' or that follows one of regular expressions ^
, \A
, or \G
at the beginning of the 'regexp' is allowed to contain I
, j
, o
modifiers.
e.g. (?sm)pattern ^(?i)pattern \G(?j)pattern \A(?ijo)pattern
Avoiding Mismatching
Using 'e' modifier in replacement or looping in a while
-clause are not supported by this module. They can be used only via a usual syntax (i.e. in m//
or s///
operators).
Use a regular expression '\A(\j*?)'
or '\G(\j*?)'
, to avoid mismatching a single-byte character on a trailing byte of a double-byte character, or a double-byte character on two bytes before and after a character boundary.
Don't forget $1
corresponds to '(\j*?)'
and backreferences intended to use begin from $2
.
Note: If matching on a very long string, a special regular expression \R{padG}
may be safer than \G(\j*?)
as the former has a lower probability of that the repeating count of *
would overflow a limit.
CAVEATS
A legal Shift-JIS character in this module must match the following regular expression:
[\x00-\x7F\xA1-\xDF]|[\x81-\x9F\xE0-\xFC][\x40-\x7E\x80-\xFC]
Any string from external resource should be checked by the function ShiftJIS::String::issjis()
, excepting you know it is surely encoded in Shift-JIS.
Use of an illegal Shift-JIS string may lead to odd results.
Some Shift-JIS double-byte characters have a trailing byte in the range of [\x40-\x7E]
, viz.,
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
The Perl's lexical analyzer doesn't take any care to these characters, so they sometimes make trouble. For example, the quoted literal ending with a double-byte character whose trailing byte is 0x5C
causes a fatal error, since the trailing byte 0x5C
backslashes the closing quote.
Such a problem doesn't arise when the string is gotten from any external resource. But writing the script containing Shift-JIS double-byte characters needs the greatest care.
The use of single-quoted heredoc, << ''
, or \xhh
meta characters is recommended in order to define a Shift-JIS string literal.
The safe ASCII-graphic characters, [\x21-\x3F]
, are:
!"#$%&'()*+,-./0123456789:;<=>?
They are preferred as the delimiter of quote-like operators.
BUGS
The
\U
,\L
,\Q
,\E
, and interpolation are not considered. If necessary, use them in""
(orqq//
) operators in the argument list.The regular expressions of the word boundary,
\b
and\B
, don't work correctly.The
i
,I
andj
modifiers are invalid to\p{}
,\P{}
, and POSIX[: :]
(e.g.\p{Lower}
,[:lower:]
, etc). So usere('\p{Alpha}')
instead ofre('\p{IsLower}', 'iI')
.The look-behind assertion like
(?<=[A-Z])
is not prevented from matching trail byte of the previous double byte character.Use of not greedy regular expressions, which can match empty string, such as
.??
and\d*?
, as the PATTERN injsplit()
, may cause failure to the emulation ofCORE::split
.
AUTHOR
SADAHIRO Tomoyuki <SADAHIRO@cpan.org>
Copyright(C) 2001-2012, SADAHIRO Tomoyuki. Japan. All rights reserved.
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.