NAME

Unicode::CharWidth - Character Width properties

VERSION

Version 1.02

SYNOPSIS

use Unicode::CharWidth;

if ( $string =~ /\p{InDoublewidth)/ ) {
    # string contains double width (two-column) characters
}

if ( $string !~ /\p{InNowidth} ) {
    # all string characters have a defined column width
}

# use capital P for negation

if ( $string =~ /\P{InSinglewidth)/ ) {
    # string contains characters that aren't single width
}

DESCRIPTION

Export

Unicode::CharWidth exports four functions: InZerowidth, InSinglewidth, InDoubleWidth and InNowidth.

These functions enable the use of like-named (inofficial) unicode properties in regular expressions. Thus /\p{InSinglewidth}/ matches all characters that occupy a single screen column.

The functions are not supposed to be called directly (they return strings that describe character properties, some of them lengthy), but are automatically called by Perl's Unicode matching system. They must be present in your current package for the "unicode properties" to work as described below.

Unicode::CharWidth normally ignores arguments in the use-statement. There is one exception:

use Unicode::CharWidth -gen

You don't ever need to run this on an installed copy of this module. See "The -gen Option" for more.

Unicode Properties

The enabled Unicode properties are InZerowidth, InSinglewidth, InDoubleWidth, and InNowidth.

They are not derived from Unicode documents directly, but rely on the implementation of the C library function wcwidth(3).

InZerowidth

/\p{InZerowidth}/ matches the characters that don't occupy column space of their own. Most of these are modifying or overlay characters that add accents or overstrokes to the preceding character. "\0" also has zero width. It is the only zero width character in the ASCII range.

InSinglewidth

/\p{InSinglewidth}/ matches what most westerners would consider "normal" characters that occupy a single screen column. All printing (non-control) ASCII characters are in this class, as well as most characters in other alphabetic scripts.

InDoublewidth

/\p{InDoublewidth}/ matches characters (in east-asian scripts) that occupy two adjacent screen columns. There are no ASCII characters in this class.

InNowidth

/\p{InNowidth}/ These are characters that don't have a (fixed) column width assigned at all. All ASCII control characters except "\0" are in this class, "\t", "\n", and "\r" are examples. Outside ASCII, vast ranges of unassigned and reserved unicode characters fall in this class.

Every unicode character has (matches) exactly one of these four character properties. Thus the column width (if any) of a character can in principle be recovered by trying it against the four regexes and registering which one matched. But use the function Text::CharWidth::mbwidth for that (under a unicode locale), it is much faster and it's what the character properties are based on in the first place.

The -gen Option

As mentioned, use Unicode::CharWidth -gen is handled as a special case. Its purpose is to generate a file that holds the definitions of the character properties exported by this module. The file (called UCW_startup) is distributed with the module, so there is no need to generate it again. If it gets lost or corrupted (rarely), you can force a re-install like with any other damaged module.

The -gen mechanism is not separated from the distribution, though techically it could, mostly for simplicity, but also, ... we're supposed to be open software, aren't we? Generating files in private and shipping them to an unsuspecting public isn't the done thing.

If you want to to run with -gen for any reason, you must be able to do a few things:

Overwrite the shipped UCW_startup file

The shipped file is installed directly next to the file .../Unicode/CharWidth.pm, as .../Unicode/UCW_startup. (Consult $INC{'Unicode/CharWidth.pm'} if in doubt.) You must have permission to overwrite/create that file as necessary.

Have Text::CharWidth installed

While this module is entirely based on Text::CharWidth, Text::CharWidth isn't a prerequisite. All the wisdom we draw from it is packed into the startup file. If you want to generate the startup file, you need the module.

Run Under a Unicode Locale

Our base function(s) from Text::CharWidth are in fact locale dependent. To make sure that the generated file conforms to unicode semantics, we must secure an appropriate locale. The effective locale for our purpose is $ENV{LC_CTYPE} || $ENV{LANG} || $ENV{LC_ALL} || '' and it must end with the string ".UTF-8" (this could probably be more liberal).

If these conditions are met, use Unicode::CharWidth generates the startup file and exits (!) with return code 0. That is so that no useful program can have the option accidentally set, it cannot be combined with a normal run.

See Also

Text::CharWidth, Unicode::EastAsianWidth

AUTHOR

Anno Siegel, <anno5 at mac.com>

BUGS

Please report any bugs or feature requests to bug-unicode-charwidth at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Unicode-CharWidth. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc Unicode::CharWidth

You can also look for information at:

ACKNOWLEDGEMENTS

The C community is the author of our grandmother function, wcwidth(3).

KUBOTA is the author of our mother module https://metacpan.org/pod/Text::CharWidth. This module is essentially based on one of its functions, mbwidth() which in its turn is based on wcwidth(3).

AUDREYT is the author of our sister module https://metacpan.org/pod/Unicode::EastAsianWidth, which was a role model for this implementation.

LICENSE AND COPYRIGHT

Copyright 2014 Anno Siegel.

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.