NAME

Mock::Data::Charset - Generator of strings from a set of characters

SYNOPSIS

# Export a handy alias for the constructor
use Mock::Data::Charset 'charset';

# Use perl's regex notation for [] charsets
my $charset = charset('A-Za-z');
        ... = charset('\p{alpha}\s\d');
        ... = charset(classes => ['digit']);
        ... = charset(ranges => ['a','z']);
        ... = charset(chars => ['a','e','i','o','u']);

# Test membership
charset('a-z')->contains('a') # true
charset('a-z')->count         # 26
charset('\w')->count          # 
charset('\w')->count('ascii') # 

# Iterate
my $charset= charset('a-z');
for (0 .. $charset->count-1) {
  my $ch= $charset->get_member($_)
}
# this one can be very expensive if the set is large:
for ($charset->members->@*) { ... }

# Generate random strings
my $str= $charset->generate($mockdata, 10); # 10 random chars from this charset
    ...= $charset->generate($mockdata, { min_codepoint => 1, max_codepoint => 127 }, 10);
    ...= $charset->generate($mockdata, { size => [5,10] }); # between 5 and 10 chars
    ...= $charset->generate($mockdata, { size => sub { 5 + int rand 5 }); # same

DESCRIPTION

This generator is optimized for holding sets of Unicode characters. It behaves just like the Mock::Data::Set generator but it also lets you inspect the member codepoints, iterate the codepoints, and constrain the range of codepoints when generating strings.

CONSTRUCTOR

new

$charset= Mock::Data::Charset->new( %options );
$charset= charset( %options );
$charset= charset( $notation );

If you supply a single non-hashref argument to the constructor, it is assumed to be the "notation" string. Otherwise, it is treated as key/value pairs. You may specify the members of the charset by one of the attributes notation, members, or member_invlist, or construct it from the following charset-building options:

chars

An arrayref of literal character values to include in the set.

codepoints

An arrayref of Unicode codepoint numbers.

ranges
ranges => [ ['a','z'], ['0', '9'] ],
ranges => [ 'a', 'z', '0', '9' ],

An arrayref holding start/end pairs of characters, optionally with inner arrayrefs for each start/end pair.

codepoint_ranges

Same as ranges but with codepoint numbers instead of characters.

classes

An arrayref of character class names recognized by perl (such as Posix or Unicode classes).

negate

Negate the membership of the charset as described by chars/ranges/classes. This applies to the charset-building options, but has no effect on attributes.

The constructor may also be given any of the keys for "generate_opts", which will be moved into that attribute.

For convenience, you may export the "charset" in Mock::Data::Util which calls this constructor.

If you call new on an object, it carries over the following settings to the new object: max_codepoint, generator_opts, member_invlist (unless chars change).

ATTRIBUTES

notation

A Perl Regex charset notation; the text that occurs between '[...]' in a regex. (Note that if you use backslash notations, like notation => '\w', you should either use a single-quoted string, or escape them as "\\w".

This returns the same string that was passed to the constructor, if you gave the constructor a regex-notation string instead of more specific attributes. If you did not, a generic-looking notation will be built on demand. Read-only.

min_codepoint

Minimum codepoint to be returned from the generator. Read/write. This is useful if you want to eliminate control characters (or maybe just NULs) in your output.

max_codepoint

Maximum unicode codepoint to be considered. Read-only. If you are only interested in a subset of the Unicode character space, such as ASCII, you can set this to a value like 0x7F and speed up the calculations on the character set.

str_len

This determines the length of string that will be returned from generate if no length is specified to that function. This may be a plain integer, an arrayref of [$min,$max], or a coderef that returns an integer: sub { 5 + int rand 10 }.

count

The number of members in the set. Read-only.

members

Returns an arrayref of each character in the set. Try not to use this attribute, as building it can be very expensive for common sets like [:alpha:] (100K members, tens of MB of RAM). Use "member_invlist" or "get_member" instead, when possible, or set "max_codepoint" to restrict the set to characters you care about.

Read-only.

member_invlist

Return an arrayref holding the "inversion list" describing the members of this set. An inversion list stores the first codepoint belonging to the set, followed by the next higher codepoint which does not belong to the set, followed by the next that does, etc. This data structure allows for efficient negation/inversion of the list.

You may write a new value to this attribute, but not modify the existing array.

METHODS

generate

$charset->generate($mockdata, $len);
$charset->generate($mockdata, \%options, $len);
$charset->generate($mockdata, \%options);

Generate a string of characters from this charset. The %options may override the following attributes: "min_codepoint", "max_codepoint" (but only smaller values), and "str_len". The default length is 1 character.

compile

Return a plain coderef that invokes "generate" on this object.

parse

my $parse_info= Mock::Data::Charset->parse('\dA-Z_');
# {
#   codepoints        => [ ord '_' ],
#   codepoint_ranges  => [ ord "A", ord "Z" ],
#   classes           => [ 'digit' ],
# }

This is a class method that accepts a Perl-regex-notation string for a charset and returns a hashref of the arguments that should be passed to the constructor.

This dies if it encounters a syntax error or any Perl feature that wasn't implemented.

get_member

my $char= $charset->get_member($offset);

Return the Nth character of the set, starting from 0. Returns undef for values greater or equal to "count". You can use negative offsets to index from the end of the list, like in substr.

get_member_codepoint

Same as "get_member" but returns a codepoint integer instead of a character.

find_member

my ($offset, $ins_pos)= $charset->find_member($char);

Return the index of a character within the members list. If the character is not a member, this returns undef, but if you call it in array context the second element gives the position where it would be found if it was a member.

negate

my $charset2= $charset->negate;

Return a new charset which contains exactly the opposite characters as this one, up to the "max_codepoint" if defined.

union

my $charset3= $charset1->union($charset2, ...);

Merge one or more charsets. The result contains every character of any set, but clamped to the max_codepoint of the current set.

The arguments may also be plain inversion list arrayrefs instead of charset objects.

AUTHOR

Michael Conrad <mike@nrdvana.net>

VERSION

version 0.04

COPYRIGHT AND LICENSE

This software is copyright (c) 2024 by Michael Conrad.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.