NAME
Random::Skew - Set up a pool of items to return one of, randomly...with some more likely than others
SYNOPSIS
use Random::Skew;
Random::Skew::GRAIN( 37 );
Random::Skew::ROUNDING( 0.5 );
my $rs = Random::Skew(
CA => 50, # CA will show up most often with the highest probability
NY => 45,
IL => 32,
TX => 31,
FL => 20,
RI => 3,
WY => 1, # WY is unlikely, but always possible
);
while ( ... ) {
my $random = $rs->item();
...
}
DESCRIPTION
For generating random data with a bit of skew to the proportions -- you first set up a pool of weighted random items to choose from, and then return one of those items at random. The relative weighting determines how likely any particular item is returned.
my $rs = Random::Skew->new(
# item => weight,
Glut => 999,
Plenty => 234,
Some => 79,
Rare => 1,
);
In the above example, "Glut" is (roughly) 999 times as likely to show up as "Rare" is. "Plenty" is 234 times as likely, and "Some" is 79 times as likely. The percentages actually returned may deviate a bit from the requested weightings, but not by much.
The higher the relative 'weight' the more likely that item will appear in the result. The lower the relative 'weight' the less likely.
For reasonably simple hierarchical data, you could try complex hash keys:
my $rs = Random::Skew->new(
'CA:Los Angeles' =>10000,
'CA:San Diego' => 6700,
'CA:San Francisco' => 4150,
'NY:New York' => 8888,
'NY:Albany' => 555,
'IL:Chicago' => 8100,
'IL:Springfield' => 321,
);
BACKGROUND
The sampling population of buckets backstage tries to be optimal. If you request weights 2,3,5 the sample population would only need 10 buckets ($tot=2+3+5
) for perfect fidelity.
You could request weights of 200, 300, 500 and with a $Random::Skew::GRAIN
of 1000 it will create an array with 200, 300 and 500 instances of each item... which is wasteful. However, those exact same weights with a $Random::Skew::GRAIN
of 10 would use the exact same array as the smaller example above (2,3,5) since the proportions are identical. To fit the 1000 items into the ten element array (when $Random::Skew::GRAIN
is 10) it would calcualte a $scale
of 0.01 and apply that $scale
to all weights.
If you tried a $Random::Skew::GRAIN
of 7 for these values, then your weights would suffer a bit from rounding issues. Instead of 5, 3, 2 you'd scale down to 3.5, 2.1, 1.4 (everything would be scaled by 7/10 to squeeze the 10 requested into a 7 array) and these would round down to 3, 2, 1 yielding an actual array of only 6 items. So instead of the requested proportions 50%, 30%, 20% you'd have 50% (3/6), 33.3% (2/6), 16.7% (1/6). A larger $Random::Skew::GRAIN
would help in this case.
In any case if the $tot
population is more than $Random::Skew::GRAIN
it'll calculate a multiplier $scale
to squeeze the weighted elements to fit into the array, while trying to keep your weighted ratios reasonably consistent.
Note: Set $Random::Skew::GRAIN
and $Random::Skew::ROUNDING
BEFORE you call Random::Skew->new()
. Changing them after you already have a Random::Skew
object won't affect it at all; it will alter the next new object create.
You can have a wide range of samples and it tries hard to represent all values, large and small, in a manageable footprint.
my $rs = Random::Skew->new(
Cornucopia => 2_000_000,
Overflowing => 173_492,
Scarce => 208,
Unlikely => 19,
);
It probably won't represent your weightings with exact 100% fidelity, but it tries to get pretty close.
There's also $Random::Skew::ROUNDING
(a value between 0.0 and 1.0) that affects fidelilty. In the case of squeezing 200, 300, 500 into a $Random::Skew::GRAIN
of 7, instead of scaling down to 3.5, 2.1, 1.4 (which would truncate to 3 buckets, 2 buckets, and 1 bucket) if $Random::Skew::ROUNDING=0.5
then you'd have 3.5+.5 (4.0), 2.1+.5 (2.6), 1.4+.5 (1.9) which truncate to 4 buckets, 2 buckets, and 1 bucket. The proportions are a bit off in both cases, it's up to you to determine which $GRAIN
and which $ROUNDING
provide the probabilities you require.
To ameliorate rounding issues, see Random::Skew::Test for how to explore different values of $Random::Skew::GRAIN
and $Rounding::Skew::ROUNDING
.
METHODOLOGY
Representing weightings that don't have a lot of variation, is a piece of cake.
Some => 5,
Thing=> 3,
Other=> 2
Those can be represented by items in an array of ten buckets which we build like so:
Some, Some, Some, Some, Some, Thing, Thing, Thing, Other, Other
Returning one of those at random is trivial. That exact same array would also work nicely for weightings of 500, 300 and 200, or 5_000_000, 3_000_000, 2_000_000, since the relative proportions are identical. (Note that fractional weightings of 0.5, 0.3, 0.2 won't work well since Perl can't populate a fraction of an array element. Weightings are expected to be positive integers for this reason.)
Weightings like the one below, though, are more challenging, because of the range from large to small:
huge => 1_000_000,
mid => 398_507,
small=> 4_362,
tiny => 1,
Having an array with a 1.4 million buckets to represent that population, is NOT a good use of resources.
So we break it up into sections.
First, we sum all the weights ($tot
). If that can fit within an array of $Random::Skew::GRAIN
elements (or smaller) then that's what we do, and we're done.
If not, we conjure a scale $scale
as a multiplier so that the entire weighted population will all fit within $Random::Skew::GRAIN
elements. And then, using the same $scale
we drop items into buckets starting with the bigger ones and moving to the smaller ones.
When the smaller ones are so small they would only fill a portion of one bucket at the current scale (driven largely by $Random::Skew::GRAIN
) that's when we go recursive.
The smaller items get a Random::Skew
object of their own (we already handled the biggest objects) and the new object becomes the smallest "bucket".
But there's more.
Say we have a $Random::Skew::GRAIN
of 25 and the following weights:
big1 => 500,
big2 => 400,
sm1 => 50,
sm2 => 40,
tiny => 10
Here, $tot = 1000
. To scale that down to 25 buckets, we calculate $scale = 25/1000 = 0.025
and we multiply each item's weight by that same $scale
to populate an array of 25 items and we get:
big1 12x
big2 10x
sm1 1x
sm2 1x
And tiny
is way too small for a bucket of its own at this scale. In fact it would only need 0.25 of a bucket. That 0.25 is the $fraction
we will use in a moment.
So we create a new Random::Skew->new()
object for the smaller items (in this case it's just the one tiny
item). We requested ten items for tiny
and that's what we will get, in the second-layer object.
Later, when we ask for a $rs->item()
it starts with a floating point random number $ix
between 0 and @{ $rs->{_set} }
. If $ix <= $fraction
then it calls the recursive object for a random item which would only be 'tiny' in this case. Otherwise, if that number represents one of the buckets in the current $rs->{_set}
array it returns that.
There can be a little gray area here -- a gap. If the random $ix
is less than $fraction then we recurse thru the smaller subset. If random $ix
is > 1.0 then it represents one of the buckets in the current set and we return that.
When it's between $fraction
and 1.0... what then? This is the portion of the "smallest bucket" that doesn't belong to the smaller items. We simply iterate again and try a new random number.
For some skew-weightings, a SMALL value for $Random::Skel::GRAIN
works best, and for others a LARGE value does better. See Random::Skew::Test for how to explore various values of $Random::Skew::GRAIN
to best fit your data.
CAVEATS
The value of $Random::Skew::GRAIN
can have a significant impact on how faithful the results are to your original weighting proportions. See Random::Skew::Test for how to explore varying values of $Random::Skew::GRAIN
.
If you want twice as many A as B you can have a grain of 3: A, A, B. A grain of 6 or 9 or 12 would work fine as well. On the other hand, it would be difficult to represent that faithfully with a grain of 4 or 5. You'd get rounding issues and some items would return more often than expected, others less often than expected.
Sometimes a larger grain (2500) works well, other times a smaller grain (13 for example) does better, with recursion.
SEE ALSO
Random::Skew::Test for exploring values of $Random::Skew::GRAIN
and $Random::Skew::ROUNDING
.
Random::Set was inspirational in getting this off the ground.
AUTHOR
Will Trillich <will@serensoft.com>
COPYRIGHT AND LICENSE
Copyright (C) 2022 by Will Trillich
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.28.1 or, at your option, any later version of Perl 5 you may have available.