NAME
Tree::BPTree - Perl implementation of B+ trees
SYNOPSIS
use Tree::BPTree;
# These arguments are actually the defaults
my $tree = new Tree::BPTree(
-n => 3,
-unique => 0,
-keycmp => sub { $_[0] cmp $_[1] },
-valuecmp => sub { $_[0] <=> $_[1] },
);
# index the entries in this string:
my $string = "THERE'S MORE THAN ONE WAY TO DO IT"; # TMTOWTDI
my $i = 0;
$tree->insert($_, $i++) foreach (split //, $string);
# find the index of the first 'T'
my $t = $tree->find('T');
# find the indexes of every 'T'
my @t = $tree->find('T');
# We don't like the word 'WAY ', so let's remove it
my $i = index $string, 'W';
$tree->delete($_, $i++) foreach (split //, substr($string, $i, 4));
# Reverse the sort order
$tree->reverse;
# Iterate through each key/value pair just like built-in each operator
while (my ($key, $value) = $tree->each) {
print "$key => $value\n";
}
# Reset the iterator when we quit from an "each-loop" early
$tree->reset;
# You might also be interested in using multiple each loops at once, which is
# possible through the cursor syntax. You can even delete individual pairs
# from the list during iteration.
my $cursor = $tree->new_cursor;
while (my ($key, $value) = $cursor->each) {
my $nested = $tree->new_cursor;
while (my ($nkey, $nvalue) = $nested->each) {
if ($key->shouldnt_be_in_this_tree_with($nkey)) {
$nested->delete;
}
}
}
# Iterate using an iterator subroutine
$tree->iterate(sub { print "$_[0] => $_[1]\n" });
# Iterate using an iterator subroutine that returns the list of return values
# returned by the iterator
print join(', ', $tree->map(sub { "$_[0] => $_[1]" })),"\n";
# Grep-like operations
my @pairs = $tree->grep (sub { $_[0] =~ /\S/ });
my @keys = $tree->grep_keys (sub { $_[0] =~ /\S/ });
my @values = $tree->grep_values (sub { $_[0] =~ /\S/ });
# Get all keys, values
my @all_keys = $tree->keys;
my @all_values = $tree->values;
# Clear it out and start over
$tree->clear;
DESCRIPTION
B+ trees are balanced trees which provide an ordered map from keys to values. They are useful for indexing large bodies of data. They are similar to 2-3-4 Trees and Red-Black Trees. This implementation supports B+ trees using an arbitrary n value.
STRUCTURE
Each node in a B+ tree contains n pointers and n - 1 keys. The pointers in the node are placed between the ordered keys so that there is one pointer on either end and one pointer in between each value. Searching for a key involves checking to see which keys in the node the key falls between and then following the corresponding pointers down the tree.
The pointers in the branches of thre tree always point to nodes deeper in the tree. The leaves use all pointers but the last to point to buckets containing values. The last pointer in each leaf forms a singly-linked list called the linked leaf list. Iterating through this list gives us an ordered traversal of all keys and/or values in the tree.
Finally, all non-root branch nodes must contain at least n/2 pointers. If it becomes necessary to add values to a node which already contains n pointers, then the node will be split in half first (possibly requiring the split of parents). If deletion of a node leaves a branch with fewer than n/2 pointers, the node will either be coalesced (joined to) a neigboring node or it will take on a pointer from a neighbor node. Coalescing can also result in the further rebalancing of the tree in parents using more coalesce or redistribute operations.
Here's a diagram of a valid B+ tree when n = 3 that stores my last name, "HANENKAMP":
---<K>----
/ \
/ \
<H> --<N>--
/ \ / \
/ \ / |
<A,E>> <H>> <K,M>> <N,P>>
/ \ | / \ / \
/ \ | | | | \
[1,6] [3] [0] [5][7] [2,4] [8]
Anyway, you don't need to know any of that to use this implementation. The abstraction layer set on top makes it look something like a typical hash. Insertion and deletion both require a specific key and value since multiple values can be mapped to each key--unless the "-unique" flag has been set.
By default, the tree assumes that it is being used to map strings to indexes. I chose to set this default because this is the most common use I will put it to. That is, I have lists of strings that I want to index, so the keys will be the strings to index and the values will be indexes into the list.
If you need to store something different, all you need to do is store a reference to the objects (keys or values) and set the "-keycmp" and "-valuecmp" options to appropriate values during initialization.
PERFORMANCE
At some point, I want to post the best, average, and worst-case operation speed for this implementation of B+ trees, but for now we'll just have to live without those stats. For raw benchmarks, you should see the BUGS section as the actual performance of this module is pretty slow.
IMPLEMENTATION
As a quick note on implementation, if you want to know how specific operations work, please browse the source. I have included extensive comments within the definitions of the methods themselves explaining most of the important steps. I did this for my own sanity because B+ trees can be quite complicated.
This code has been optimized a bit, but I haven't nearly made as many optimizations as are likely possible. I'm open to any suggestions. If you have some, send me email at the address given below.
METHOD REFERENCE
- $tree = Tree::BPTree->new(%args)
-
The constructor builds a new tree using the given arguments. All arguments are optional and have defaults that should suit many applications. The arguments include:
- -n
-
This sets the maximum number of pointers permitted in each node. Setting this number very high will cause search operations to slow down as it will spend a lot of time searching arrays incrementally--something like a binary search could be used to speed these times a bit, but no such method is used at this time. Setting this number very low will cause insert and delete operations to slow down as they are required to split and coalesce more often. The default is the minimum value of 3.
- -unique
-
This determines whether keys are unique or not. If this is set, then an exception will be raised whenever an insert is attempted for a key that already exists in the tree.
- -keycmp
-
This is a comparator function that takes two arguments and returns -1, 0, or 1 to indicate the result of the comparison. If the first argument is less than the second, then -1 is returned. If the first argument is greater than the second, then 1 is returned. If the arguments are equal, then 0 is returned. This comparator should be appropriate for comparing keys. By default, the built-in string comparator
cmp
is used. See perlop for details oncmp
. - -valuecmp
-
This is a comparator function that takes two arguments and returns -1, 0, or 1 to indicate the result of the comparison--just like the "-keycmp" argument. This comparator should be appropriate for comparing values. By default, the built-in numeric comparator
<=>
is used. See perlop for details on<=>
.
The tree created by this constructor is always initially empty.
- $value = $tree->find($key)
- @values = $tree->find($key)
-
This method attempts to find the value or values in the bucket matching
$key
. If no such$key
has been stored in the tree, thenundef
is returned. If the$key
is found, then either the first value stored in the bucket is returned (in scalar context) or all values stored are returned (in list context). Using scalar context is useful when the tree stores unique keys where there will never be more than one value per key. - $tree->insert($key, $value)
-
This method inserts the key/value pair given into the tree. If the tree requires unique keys, an exception will be thrown if
$key
is already stored. - $tree->delete($key, $value)
-
This method removes the key/value pair given from the tree. If the pair cannot be found, then the tree is not changed. If
$value
is stored multiple times at$key
, then all values matching$value
will be removed. - $tree->reverse
-
Reverse the sort order. This is done by reversing every key in the tree, adjusting the linked leaf list, and replacing the "-keycmp" method with a new one that simply negates the old one. If this method is called again, the same node reversal will happen, but the original "-keycmp" will be reinstated rather than doing a double negation.
- $cursor = $tree->new_cursor
-
This method allows you to have multiple, simultaneous iterators through the same index. If you pass the
$cursor
value returned fromnew_cursor
toeach
, it will be used instead of the default internal cursor. That is,my $c1 = $tree->new_cursor; my $c2 = $tree->new_cursor; while (my ($key, $values) = $tree->each($c1)) { # let's go through $c1 twice as fast my ($nextkey, $nextvalue) = $tree->each($c1); # next is an alias for each my ($otherkey, $othervalue) = $tree->next($c2); } # and we can reset $c2 after we're done too $tree->reset($c2);
Cursors also have their own methods, so this same snippet could have been written like this instead:
my $c1 = $tree->new_cursor; my $c2 = $tree->new_cursor; while (my ($key, $value) = $c1->each) { # let's go through $c1 twice as fast my ($nextkey, $nextvalue) = $c1->each; # next is an alias for each my ($otherkey, $othervalue) = $c2->each; } # and we can reset $c2 after we're done too $c2->reset;
There are additional features provided with cursors that are not provided when using the internal cursor. You may delete the last key/values pair returned by a call to
each
/next
by callingdelete
on the cursor. Or, you may specify a specific value in the bucket to be deleted. For example:my $cursor = $tree->new_cursor; while (my ($key, $value) = $cursor->next) { # In this example, the keys are objects with a is_bad method. If "bad" is # set, we want to remove the corresponding values. if ($key->is_bad) { $cursor->delete; } }
This form of delete is completely safe and will not cause the iterator to slip off track as a similar operation might mess up array iteration if one isn't careful.
Another feature of cursors, is that you may retrieve the previously returned value by calling the
current
method. This will return the same result as the last call tonext
oreach
. That is, unlessreset
has been called ordelete
removed the previously returned key, then this will return an empty list.For example:
# This assumes you use the typical string keys with numeric values $cursor = $tree->new_cursor; while (my ($key, $value) = $cursor->next) { my ($currkey, $currval) = $cursor->current; die unless $key eq $currkey and $value == $currval }
This example shouldn't die.
- ($key, $value) = $tree->each [ ($cursor) ]
-
This method provides a similar facility as that of the
each
operator. Each call will iterate through each key/value pair in sort order. After the last key/value pair has been returned,undef
will be returned once before starting again. This is useful for using withinwhile
loops:while (my ($key, $value) = $tree->each) { # do stuff }
- $tree->reset [ ($cursor) ]
-
Reset the given cursor to a fresh state--that is, ready to return the first value on the next call to
each
. If no$cursor
is given, then the default internal cursor is reset. - $tree->iterate(\&iter)
-
For each key/value pair in the database, the function
&iter
will be called with the key as the first argument and value as the second. Iteration will occur in sort order. - @results = $tree->map(\&mapper)
-
Nearly identical to
iterate
, this method captures the return values of each call and then returns all the results as a list. The&mapper
function takes the same arguments as initerate
. - @pairs = $tree->grep(\&pred)
- @keys = $tree->grep_keys(\&pred)
- @values = $tree->grep_values(\&pred)
-
Iterates through all key/value pairs in sort order. For each key/value pair, the function
&pred
will be called by passing the key as the first argument and the value as the second. If&pred
returns a true value, then the matched value will be added to the returned list.grep
returns a list of pairs such that each element is a two-element array reference where the first element is they key and the second is the value.grep_keys
returns a list of keys.grep_values
returns a list of values. - @pairs = $tree->pairs
- @keys = $tree->keys
- @values = $tree->values
-
Returns all elements of the given type.
pairs
returns all key/value pairs stored in the tree. Each pair is returned as an array reference contain two elements. The first element is the key. The second element is a bucket, which is an array-reference of stored values.keys
returns all keys stored in the tree.values
returns all values stored in the tree. - $tree->clear
-
This method empties the tree of all values. This basically creates a new tree and allows the old tree to be garbage collected at the interpreter's leisure.
CREDITS
The basis for B+ trees implemented here can be found in Database System Concepts, 4th ed. by Silbershatz et al. published by McGraw-Hill. I have somewhat modified the structure specified there to make the code easier to read and to adapt the code to Perl.
In addition, while preparing to write this module I also consulted an old book of mine, C++ Algorithms by Robert Sedgewick (Addison Wesley), for more general information on trees. I also used some ideas on how and when to perform split, coalesce, and redistribute as the Silbershatz pseudo-code is a little obfuscated--or at least, the different operations are presented monolithically so that it's difficult to digest. The sections in Sedgewick on 2-3-4 and Red-Black trees were especially helpful.
BUGS
This module is pretty slow. Better performance is possible, especially for small bodies of data, if you use a hash to do most of these operations. See benchmark.pl for a sample of the performance issues. There you can also find code for performing essentially the same thing using different data structures.
On my machine, a small benchmark showed the following:
Insert into B+ Trees (this implementation) is:
61 times slower than hash insert and
3.9 times slower than ordered list insert.
Ordered iteration of B+ Trees is:
1.6 times slower than ordering a hash and then iterating the pairs and
14 times slower than iterating through an ordered list.
Finding a key in B+ Trees is:
34 times slower than hash fetch but
1.2 times faster than searching an ordered list (with grep, which probably
isn't the fastest solution, a manual binary search should be better).
I'm still putting together more benchmarks and looking into places where improvement is possible. Iteration of this structure should scale better than taking a hash and ordering the keys to iterate through.
I have made some recent headway by removing some simple functions and replacing them with raw computation. If I did this the way I'd really like to, I need to find or build a Filter::Simple module to perform something similar to a C #define
or C++ inline
function. However, instead I just did a search and replace with Vim.
I should probably port this to XS to make it really compete with built-in hashes.
AUTHOR
Andrew Sterling Hanenkamp, <hanenkamp@users.sourceforge.net>
COPYRIGHT AND LICENSE
Copyright 2003 by Andrew Sterling Hanenkamp
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.