TITLE
CharClass::Matcher -- Generate C macros that match character classes efficiently
SYNOPSIS
~/perl$ perl Porting\regcharclass.pl
DESCRIPTION
Dynamically generates macros for detecting special charclasses in latin-1, utf8, and codepoint forms. Macros can be set to return the length (in bytes) of the matched codepoint, or the codepoint itself.
To regenerate regcharclass.h, run this script from perl-root. No arguments are necessary.
Using WHATEVER as an example the following macros will be produced:
- is_WHATEVER(s,is_utf8)
- is_WHATEVER_safe(s,e,is_utf8)
-
Do a lookup as appropriate based on the is_utf8 flag. When possible comparisons involving octect<128 are done before checking the is_utf8 flag, hopefully saving time.
- is_WHATEVER_utf8(s)
- is_WHATEVER_utf8_safe(s,e)
-
Do a lookup assuming the string is encoded in (normalized) UTF8.
- is_WHATEVER_latin1(s)
- is_WHATEVER_latin1_safe(s,e)
-
Do a lookup assuming the string is encoded in latin-1 (aka plan octets).
- is_WHATEVER_cp(cp)
-
Check to see if the string matches a given codepoint (hypotethically a U32). The condition is constructed as as to "break out" as early as possible if the codepoint is out of range of the condition.
IOW:
(cp==X || (cp>X && (cp==Y || (cp>Y && ...))))
Thus if the character is X+1 only two comparisons will be done. Making matching lookups slower, but non-matching faster.
Additionally it is possible to generate what_
variants that return the codepoint read instead of the number of octets read, this can be done by suffixing '-cp' to the type description.
CODE FORMAT
perltidy -st -bt=1 -bbt=0 -pt=0 -sbt=1 -ce -nwls== "%f"
AUTHOR
Author: Yves Orton (demerphq) 2007
BUGS
No tests directly here (although the regex engine will fail tests if this code is broken). Insufficient documentation and no Getopts handler for using the module as a script.
LICENSE
You may distribute under the terms of either the GNU General Public License or the Artistic License, as specified in the README file.