NAME

Regexp::Keywords - A regexp builder to test against keywords lists

VERSION

Version 0.03

SYNOPSIS

This module helps you to search inside a list of keywords for some of them, using a simple query syntax with AND, OR and NOT operators and grouping.

use Regexp::Keywords;
my $kw = Regexp::Keywords->new();

my $wanted = 'comedy + ( action , romance ) - thriller';
$kw->prepare($wanted);

my $movie_tags = 'action,comedy,crime,fantasy,adventure';
print "Buy ticket!\n" if $kw->test($movie_tags);

Keywords, also known as tags, are used to classify things in a category. Many tags can be assigned at the same time to an item, even if they belong to different available categories.

In real life, keywords lists are found in:

  • Public databases like IMDB.

  • Metadata of HTML pages from public services, such as Picasa or Youtube

  • Metadata of Word, Excel and other documents.

CONSTRUCTOR

new ( )

Creates a Keywords object. Some attributes can be initialized from the constructor:

See Attributes for a description of these attributes.

Example: To create a Keywords object that will be used to test strings with mixed case keywords:

my $kw = Keywords->new(ignore_case => 1);

BUILDING METHODS

$kw->prepare( $query )

Parse a query and build a regexp pattern to be used for keywords strings tests. Dies on malformed query expressions. See Query Expressions later in this doc.

$kw->set( attribute => value [, ...] )

The following attributes can be changed after the object creation:

Dies on unknown attributes. See Attributes section for a description of each attribute.

Note: Some of this attributes invalidates the associated regexp if it was already built, so an automatic reparse or rebuild is done after changing all the specified attributes. For the same reason, is better to call set with many parameters instead of setting one at a time.

Note: It's not recommended to modify the attributes directly from the object, or you could get unexpected results if the query is not parsed or built again.

$kw->get( 'attribute' )

This method returns the current value for the specified attribute. Dies on unknown attributes. See Attributes section for a list of available attributes.

$kw->reparse( )

If any of the object's attribute changes, a reparse of the source query may be required, depending on the affected attribute. Dies on bad queries.

$kw->rebuild( )

If any of the object's attribute changes, a rebuild of the regexp may be required, depending on the affected attribute. Dies on bad parsed queries.

KEYWORDS TESTING METHODS

$kw->test( $keyword_list )

Returns true if the list matches the parsed query, otherwise returns false. Dies if no query has been parsed yet.

$kw->grep( @list_of_kwlists )

Returns an array only with the keywords lists that matches de parsed query. Dies if no query has been parsed yet.

@selected_keys = $kw->grep_keys(map {$_ => $table{$_}[$col]} keys %table);

@selected_indexes = $kw->grep_keys(map {$_ => $array[$_]} 1 .. $#table);

$kw->grep_keys( %hash_of_kwlists )

Returns an array of keys from a hash when their corresponding values satisfy the query. Dies if no query has been parsed yet.

EXPORTED FUNCTION

The following function can be imported and accessed directly from your program.

keywords_regexp( $query [, $ignore [, $multi [, $partial [, $texted ] ] ] ] )

Returns a regular expression (qr/.../) for a query to which keywords lists strings can be tested against.

See Attributes section for a description of the attributes for the corresponding parameters and the default values if ommitted.

ATTRIBUTES

Object's attributes can control how to parse a query, build a regular expression or test strings.

Is it possible to access them using $kw->{attribute}, it's better to read them with $kw->get() and change them with $kw->set(), because some validations are done to keep things consistent.

ignore_case

Defines if the regexp should be case (in)sensitive.

Defaults to case sensitive (a value of 0). Set to 1 turn the regexp into case insenitive.

Note: Changing this parameter with $kw->set() after regexp has been built, causes the regexp to be rebuilt from parsed_query.

multi_words

This attribute controls whether the keywords list may include many words as a single keyword.

The default (0) is to treat each word as a keyword. When this attribute is 1, the keywords list may include many words as a single keyword. When is set to 2, the delimiter between words is not a space. To search for such a keyword, write the words between quotes in the query string.

Note: Changing this parameter with $kw->set() after regexp has been built, causes the regexp to be rebuilt from parsed_query.

Note: When set to 0 or 2, a query with strings in quotes could match a keyword list if each word is present in the list, side by side in the same order.

parsed_query

Contains the query in the internal boolean format, which is required to build the regexp.

partial_words

By default (value of 0), only words that match exactly would return true when a keywords list is tested. Set this attribute to 1 if you want to match lists where keywords contains words from the query.

For example, "word" will match if a list contains "words", but "query" won't match "queries".

Note: Changing this parameter with $kw->set() after regexp has been built, causes the regexp to be rebuilt from parsed_query.

Note: Setting both partial_words and multi_words to 1 could return unexpected results on tests, because just first and last words will be considered to be partial strings only from the outside.

query

Contains the original query in the free-style syntax.

regexp

Contains the regular expresion built for the object's query. It's a qr/.../ value!

texted_ops

AND, OR and NOT operators are represented by some punctuation chars. In default mode (0), any use one of that words would try to match it in the keywords list. Set this attribute to 1 to allow words AND, OR and NOT to be used as binary operators in query expressions instead of keywords to match.

Note: Changing this attribute with $kw->set() after a regexp has been built, forces a query to be reparsed into parsed_query and regexp to be rebuilt.

KEYWORD LISTS

A Keyword is a combination of letters, underlines and numbers (/\w+/ pattern). Sometimes, more than one word can be used to create a keyword, and a space is between them.

Keyword lists are string values with words, usually delimited by comma or any other punctuation sign. Spaces may also appear surrounding them.

There is no validation for field names inside a keywords list. In fact, that names are also treated as keywords by themselves (see Tricks).

QUERY EXPRESSIONS

A query expression is a list of keywords with some operators surrounding them to provide simple boolean conditions.

Query expressions are in the form of:

term1 & term2              # AND operator
term1 | term2              # OR operator
!term1                     # NOT operator
"term one"                 # multi-word keyword
term1 & ( term2 | term3 )  # Grouping changes precedence

All spaces are optional in query expressions, except for those in multi-word keywords when quoted.

Expression Terms

A term is one of the following:

  • A single keyword, build with letters, numbers and underscore. "." (dot) can be used as a single char wildcard.

  • A sentence of multiple words as a single keyword, enclosed by quotes.

  • A query expression, optionally enclosed by parenteses if precedence matters.

Operators

  • term1 AND term2

    Use AND operator when both terms must be present in the keyword list.

    AND can be written as "&" (andpersand) or "+" (plus), but may be ommited.

    term1 & term2
    +term1 +term2
    term1 term2
  • term1 OR term2

    Use OR operator when at least one of the terms is required in the keyword list.

    OR can be written as "|" (vertical bar) or "," (comma), and cannot be ommited.

    term1 | term2
    term1, term2
  • NOT term

    Use NOT operator when the term must not be present in the keyword list.

    NOT can be written as "!" (exclamation mark) or "-" (minus).

    ! term
    -term

To allow the words "AND", "OR" and "NOT" to be treated as operators, set the texted_ops parameter.

Grouping

Precedence is as usual: NOT has the highest, then AND, and OR has the lowest.

Precedence order in a query expression can be changed with the use of parenteses. For example:

word1 | word2 & word3

is the same as:

word1 | ( word2 & word3 )

but not as:

( word1 | word2 ) & word3

where word3 is required at the same time than either word1 or word2.

Is it possible to use NOT for a whole group, so the following two queries mean the same:

+word1 -(word2,word3)
+word1 -word2 -word3

Expresion groups can be nested. Also, "[...]", "{...}" and "<...>" can be used just like "(...)", but there is no validation for balanced parenteses by type, i.e. all of them gets translated into the same before the validation to detect an orphan one.

Tricks

  • If fields names and their corresponding values are specified inside a keywords list, is it possible to use a single dot "." to say "key.value" as a single term in a query expression for a better match.

    For example, the following query expressions:

    bar & read.yes      # matches 2 record
    bar & read & yes    # matches 3 records
    bar & "read yes"    # matches 2 record when multi_words=2
                        #  else don't match

    from these keywords lists:

    foo, own:yes, read:yes, rating:3
    foo, bar, own:yes, read:yes, rating:1
    foo, bar, baz, own:yes, read:no
    bar, baz, own:no, read:yes, rating:0
  • Query with strings in quotes could match a keyword list if each word is present in the list, side by side in the same order, when the multi_words is NOT set to 1.

    Using the previous sample list, the query expressions:

    "foo bar"     # matches 2 records
    "bar foo"     # don't match anything
  • Use OR operator when two or more different conditions satisfies the request. For example, use:

    own.yes (rating.0 | -rating)

    to match 2 unrated owned books from the sample list.

  • You can use this module against a whole document, not only to a keywords list:

    $kw->prepare('"form method post" !captcha');
    print "Unprotected form detected\n" if $kw->test($html_page);

INTERNAL BOOLEAN FORMAT

Queries in the free-style format are parsed and translated into an strict internal format. Note that space char is not allowed.

The elements of this format are:

  • & (andpersand)

    AND operator. It can't be ommited as in free-style format. Must be surrounded by (negated) keywords or parenteses from the outside.

  • | (vertical bar)

    OR operator. Must be surrounded by (negated) keywords or parenteses from the outside.

  • ! (exclamation mark)

    NOT operator. It can appear only preceding a keyword, not a parenteses or another one.

  • ( ) (parenteses)

    Group delimiters. Only keywords and other parenteses can touch them from inside. Nested groups are allowed, empty groups are not.

  • keyword

    A word that matches /\w+/ (letters, numbers or underscore). It can optionally contain wildcards or space placeholder following their own rules.

  • . (dot)

    Single char wildcard. A word can contain multiple wildcards, but starting or ending with one may give unpredictable results on test. Use with care.

  • ^ (caret)

    Space placeholder. Used to join multiple words as a single keyword. This is the internal representation of quoted strings with spaces from the free-style query. It's not allowed to start or finish a keyword with this space placeholder, and consecutive placeholders are also invalid.

Examples:

tom&jerry|sylvester&tweety
moe&(shemp|curly|joe)&larry
popeye&olive&(!bluto&!brutus)
hagar^the^horrible|popeye^the^sailor

Examples of bad queries:

tom&jerry,sylvester&tweety
moe(shemp|curly|joe)larry
popeye&olive&!(bluto|brutus)
^the^

KNOWN LIMITATIONS

Currently, only ASCII chars are supported. No UTF-8, no Unicode, no accented vowels, no Kanji... Sorry!

AUTHOR

Victor Parada, <vitoco at cpan.org>

BUGS

Please report any bugs or feature requests to bug-regexp-keywords at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Regexp-Keywords. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

perldoc Regexp::Keywords

You can also look for information at:

ACKNOWLEDGEMENTS

Thank's to the Monks from the Monastery at http://www.perlmonks.org/.

COPYRIGHT & LICENSE

Copyright 2009 Victor Parada.

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.