NAME
CharClass::Matcher -- Generate C macros that match character classes efficiently
SYNOPSIS
perl Porting/regcharclass.pl
DESCRIPTION
Dynamically generates macros for detecting special charclasses in latin-1, utf8, and codepoint forms. Macros can be set to return the length (in bytes) of the matched codepoint, and/or the codepoint itself.
To regenerate regcharclass.h, run this script from perl-root. No arguments are necessary.
Using WHATEVER as an example the following macros can be produced, depending on the input parameters (how to get each is described by internal comments at the __DATA__
line):
is_WHATEVER(s,is_utf8)
is_WHATEVER_safe(s,e,is_utf8)
-
Do a lookup as appropriate based on the
is_utf8
flag. When possible comparisons involving octet<128 are done before checking theis_utf8
flag, hopefully saving time.The version without the
_safe
suffix should be used only when the input is known to be well-formed. is_WHATEVER_utf8(s)
is_WHATEVER_utf8_safe(s,e)
-
Do a lookup assuming the string is encoded in (normalized) UTF8.
The version without the
_safe
suffix should be used only when the input is known to be well-formed. is_WHATEVER_latin1(s)
is_WHATEVER_latin1_safe(s,e)
-
Do a lookup assuming the string is encoded in latin-1 (aka plan octets).
The version without the
_safe
suffix should be used only when it is known thats
contains at least one character. is_WHATEVER_cp(cp)
-
Check to see if the string matches a given codepoint (hypothetically a U32). The condition is constructed as to "break out" as early as possible if the codepoint is out of range of the condition.
IOW:
(cp==X || (cp>X && (cp==Y || (cp>Y && ...))))
Thus if the character is X+1 only two comparisons will be done. Making matching lookups slower, but non-matching faster.
what_len_WHATEVER_FOO(arg1, ..., len)
-
A variant form of each of the macro types described above can be generated, in which the code point is returned by the macro, and an extra parameter (in the final position) is added, which is a pointer for the macro to set the byte length of the returned code point.
These forms all have a
what_len
prefix instead of theis_
, for examplewhat_len_WHATEVER_safe(s,e,is_utf8,len)
andwhat_len_WHATEVER_utf8(s,len)
.These forms should not be used except on small sets of mostly widely separated code points; otherwise the code generated is inefficient. For these cases, it is best to use the
is_
forms, and then find the code point withutf8_to_uvchr_buf
(). This program can fail with a "deep recursion" message on the worst of the inappropriate sets. Examine the generated macro to see if it is acceptable. what_WHATEVER_FOO(arg1, ...)
-
A variant form of each of the
is_
macro types described above can be generated, in which the code point and not the length is returned by the macro. These have the same caveat as "what_len_WHATEVER_FOO(arg1, ..., len)", plus they should not be used where the set contains a NULL, as 0 is returned for two different cases: a) the set doesn't include the input code point; b) the set does include it, and it is a NULL.
The above isn't quite complete, as for specialized purposes one can get a macro like is_WHATEVER_utf8_no_length_checks(s)
, which assumes that it is already known that there is enough space to hold the character starting at s
, but otherwise checks that it is well-formed. In other words, this is intermediary in checking between is_WHATEVER_utf8(s)
and is_WHATEVER_utf8_safe(s,e)
.
CODE FORMAT
perltidy -st -bt=1 -bbt=0 -pt=0 -sbt=1 -ce -nwls== "%f"
AUTHOR
Author: Yves Orton (demerphq) 2007. Maintained by Perl5 Porters.
BUGS
No tests directly here (although the regex engine will fail tests if this code is broken). Insufficient documentation and no Getopts handler for using the module as a script.
LICENSE
You may distribute under the terms of either the GNU General Public License or the Artistic License, as specified in the README file.