NAME

Unicode::Regex::Set - Subtraction and Intersection of Character Sets in Unicode Regular Expressions

SYNOPSIS

use Unicode::Regex::Set qw(parse);

$regex = parse('[\p{Latin} & \p{L&} - A-Z]');

DESCRIPTION

Perl 5.8.0 misses subtraction and intersection of characters, which is described in Unicode Regular Expressions (UTS #18). This module provides a mimic syntax of character classes including subtraction and intersection, taking advantage of look-ahead assertions.

The syntax provided by this module is considerably incompatible with the standard Perl's regex syntax.

Any whitespace character (that matches /\s/) is allowed between any tokens. Square brackets ('[' and ']') are used for grouping. A literal whitespace and square brackets must be backslashed (escaped with a backslash, '\'). You cannot put literal ']' at the start of a group.

A POSIX-style character class like [:alpha:] is allowed since its '[' is not a literal.

SEPARATORS ('&' for intersection, '|' for union, and '-' for subtraction) should be enclosed with one or more whitespaces. E.g. [A&Z] is a list of 'A', '&', 'Z'. [A-Z] is a character range from 'A' to 'Z'. [A-Z - Z] is a set by removal of [Z] from [A-Z].

Union operator '|' may be omitted. E.g. [A-Z | a-z] is equivalent to [A-Z a-z], and also to [A-Za-z].

Intersection operator '&' has high precedence, so [\p{A} \p{B} & \p{C} \p{D}] is equivalent to [\p{A} | [\p{B} & \p{C}] | \p{D}].

Subtraction operator '-' has low precedence, so [\p{A} \p{B} - \p{C} \p{D}] is equivalent to [[\p{A} | \p{B}] - [\p{C} | \p{D}] ].

[\p{A} - \p{B} - \p{C}] is a set by removal of \p{B} and \p{C} from \p{A}. It is equivalent to [\p{A} - [\p{B} \p{C}]] and [\p{A} - \p{B} \p{C}].

Negation. when '^' just after a group-opening '[', i.e. when they are combined as '[^', all the tokens following are negated. E.g. [^A-Z a-z] matches anything but neither [A-Z] nor [a-z]. More clearly you can say this with grouping as [^ [A-Z a-z]].

If '^' that is not next to '[' is prefixed to a sequence of literal characters, character ranges, and/or metacharacters, such a '^' only negates that sequence; e.g. [A-Z ^\p{Latin}] matches A-Z or a non-Latin character. But [A-Z [^\p{Latin}]] (or [A-Z \P{Latin}], for this is a simple case) is recommended for clarity.

If you want to remove anything other than PERL from [A-Z], use [A-Z & PERL] as well as [A-Z - [^PERL]]. Similarly, if you want to intersect [A-Z] and a thing not JUNK, use [A-Z - JUNK] as well as [A-Z & [^JUNK]].

For further examples, please see tests.

FUNCTION

$perl_regex = parse($unicode_character_class): parses a Character Class pattern according to Unicode Regular Expressions and converts it into a regular expression in Perl (returned as a string).

AUTHOR

SADAHIRO Tomoyuki <SADAHIRO@cpan.org>

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

DESCRIPTION

FUNCTION

AUTHOR

SEE ALSO

Module Install Instructions