NAME

Lingua::LinkParser::MatchPath - Match paths in linkage diagrams

SYNOPSIS

use Lingua::LinkParser::MatchPath;

$matcher = path_matcher($template); # see below for the tutorial

our $parser = new Lingua::LinkParser;
our $sentence = $parser->create_sentence($text);

if($matcher->match($parser, $sentence)){
   print "COOL!\nHave got ", join( q/ /, $matcher->item), "\n";
}

DESCRIPTION

This module can help check if a linkage path exists in a linkage diagram generated by Lingua::LinkParser, and can help parse English texts.

path_matcher($template)

path_matcher() is auto-exported when importing the module and it creates a state machine according to our template. The template tutorial is included below.

$matcher = path_matcher($template);

match($parser, $sentence)

After matcher gets initiated, we can call match() to see if it can match any path in linkages of our sentences. The $sentence is a sentence object created using create_sentence() provided by Lingua::LinkParser. Lingua::LinkParser::MatchPath is subclass of Lingua::LinkParser, so methods of Lingua::LinkParser can be called directly without re-importing it manually.

$parser = new Lingua::LinkParser;
$matcher->match($parser, $parser->create_sentence($sentence));

item()

item() is reserved to retrieve link labels and words along the path. We can pass arguments specifying which items we would like to get. The index counts from 0.

@item = $matcher->item();         # retrieve all of matched items
@item = $matcher->item(0, 2, 3);  # retrieve 0, 2, 3
@item = $matcher->item(1, 3..5);  # retrieve 1, 3, 4, 5

Please see below for detail.

TEMPLATE TUTORIAL

The remaining part of this document will show us how to use the simple but powerful template language.

EXAMPLE I

Begin in words. End in words.

Given a sentence : 'Gunther sees Rachel.', and here is a linkage diagram generated by link parser.

+-------------Xp------------+
+---Wd---+---Ss--+--Os--+   |
|        |       |      |   |
LEFT-WALL Gunther sees.v Rachel .

And now, the goal is to form a template to match the sentence and to extract the words on the linking path.

If we have a template like this:

Gunther <Ss>  sees  <Os> Rachel

 ^       ^     ^      ^    ^  
 |       |     |      |    |  
WORD    LINK  WORD  LINK  WORD

In this example, the path matcher will first locate the position of Gunther, and check if one of Gunther's linkages contains the label Ss. If Ss exists, the matcher will continue to see if sees is further linked by Ss. The process goes on until matcher reaches full matching or it fails.

Here, this template will match the sentence successfully.

For the definitions of link labels, please go to http://www.link.cs.cmu.edu/link/dict/index.html

EXAMPLE II

If we have two sentences:

Ross bites Monica.

and

Joey bites Monica too.

The diagrams are as follows respectively:

+------------Xp-----------+
+---Wd--+--Ss-+---Os--+   |
|       |     |       |   |
LEFT-WALL Ross bites.v Monica .

+--------------Xp-------------+
|             +-----MVa----+  |
+---Wd--+--Ss-+---Os--+    |  |
|       |     |       |    |  |
LEFT-WALL Joey bites.v Monica too .

There is no need to build two templates:

Ross <Ss> bites <Os> Monica

and

Joey <Ss> bites <Os> Monica

Instead, we can combine them two into one using regexp (regular expression), and it becomes

/Ross|Joey/ <Ss> bites <Os> Monica

Also, we can add a case-insensitive modifier to regexps.

/ROSS|JOEY/i <Ss> bites <Os> Monica

Our regexp fully complies with perl's regexp. For regexp tutorial, please see perlretut.

EXAMPLE III

There is a situation in which we are sure that some words must belong to a certain class of words in order to satisfy the template, and then POS (part-of-speech) tag can be used for that.

Given a linkage like this,

+---------------------------Xp---------------------------+
|              +----I*d---+------Osn------+              |
+---Wd---+--Ss-+--N-+     +--K-+  +---Ds--+---Mp--+-Js+  |
|        |     |    |     |    |  |       |       |   |  |
LEFT-WALL Monica did.v not blow.v up the apartment.n of Ross .

We write a template like this,

/^Monica/ <I*d> blow <Osn> apartment <Mp> of <Js> Ross

and then we will need to duplicate a house of templates for matching and miss many linkages with the same structures

Besides using regexps, we can also use POS tags to generalize our templates in this situation.

/^Monica/ <I*d> _v_ <Osn> _n_ <Mp> of <Js> Ross

or even

/.+?/ <I*d> _v_ <Osn> _n_ <Mp> of <Js> /.+?/

Supported tags are v for verb, a for adjective, d for determiner, p for pronoun, n for noun, etc.

The POS tags attached to words in the above diagram are auto-identified by LinkParser. This POS-tag feature of pathmatcher is only valid with identified classes.

EXAMPLE IV

Regexp can not only be used with words, but with link labels too.

Let's take the above template as an example.

/.+?/ <I*d> _v_ <Osn> _n_ <Mp> of <Js> /.+?/

If we change the template into this one,

/.+?/ </^I/> _v_ </^O/> _n_ </^M/> of </^J/> /.+?/

then the link labels with I, O, M, J as their first characters will be matched.

EXAMPLE V

Here we introduce our defined branching operator, with which we are able to write branching templates. This is designed to match multiple link labels emitted from a word. Otherwise, the pointer will march on to the next word and continue the matching process.

One common situation is negation. Here we use two simple sentences with opposite semantics to illustrate this situation:

someone is here

and

no one is here.

And their diagrams:

+-----------Xp----------+
+---Wd---+--Ss--+-Pp-+  |
|        |      |    |  |
LEFT-WALL someone is.v here .

+-----------Xp-----------+
+-----Wd----+            |
|       +-Ds+-Ss-+-Pp-+  |
|       |   |    |    |  |
LEFT-WALL no.d one is.v here .

They are semantically opposite, but both are fit into a common structure:

/one$/ -> <Ss> -> is -> <Pp> -> here

If we merely write template as

/one$/ <Ss> is <Pp> here

, then both linkages will be matched despite their different semantics. This is not usually what we want. Usually, we hope to seperate these two types of semantics, and that is why we introduce the branch label. With it, this problem is simply solved.

The branch label comes in positive and negative types.

Positive type implies if we have a certain branch emitted from the current word, then matching is successful; negative one implies successful matching if we do NOT have a certain branch emitted from the current word.

Appending a # to the front of a label is indicating the label is tagged as a positive branch, and ! as a negative one.

Then now, we can write down our templates to match the two different semantics.

For the first case, we don't want to see <Ds> no in the diagram

/one$/ !<Ds> no <Ss> is <Pp> here.
/one$/ !( <Ds> no ) <Ss> is <Pp> here.

For the second case, we must see a <Ds> no.

/one$/ #<Ds> no <Ss> is <Pp> here
/one$/ #( <Ds> no ) <Ss> is <Pp> here

Of course, the second template can also be written as

no <Ds> one <Ss> is <Pp> here

, but it loses the flavor of branching operator and deviates from the educational intention.

EXAMPLE VI

Another type of branching, called 'grouping', is introduced here, with which we can write optional paths for a template.

/John/ @( <Ss> _v_ | <AN> _a_ )

In this case, the matcher will first try to match <Ss> and _v_ after successfully matching John. If it fails, it will try <AN> and _a_ later. @ is used for grouping, with which we can group various template paths together into one. If parentheses without anything appended to the front, @ will be appended.

EXAMPLE VIII

Another operator is designed to capture desired words. In one of the above examples,

%/^Monica/ <I*d> _v_ <Osn> _n_ <Mp> of <Js> Ross

if we add % to some of the word templates like

%/^Monica/ <I*d> %_v_ <Osn> %_n_ <Mp> of <Js> Ross

, then call item(). The method will return 'blow' and 'apartment' for this example here. This feature is useful for further processing.

EXAMPLE IX

A finer word capturing can be done using (). In the above example,

%/^Monica/ <I*d> %_v_ <Osn> %_n_ <Mp> of <Js> Ross

If we parenthesize Mon in Monica as

%/^(Mon)ica/ <I*d> %_v_ <Osn> %_n_ <Mp> of <Js> Ross

After a successful matching, we can get Mon calling $matcher->item(0).

THE GRAMMAR

The grammar of the template language is listed, and the full grammar with semantic actions are in etc/Grammar.y

START         -> RULE END_OF_RULE;
RULE          -> WORD_PATTERN LINKS;
LINKS         -> LINK | LINK LINKS | PLINKS |
                 LINKS OR LINKS |
                 # PLINKS LINKS | @ PLINKS LINKS | ! PLINKS LINKS |
                 _EPSILON_;
PLINKS        -> ( LINKS );
LINK          -> LABEL_PATTERN WORD_PATTERN;
LABEL_PATTERN -> LABEL | LABEL_REGULAR_EXPRESSION;
WORD_PATTERN  -> WORD_ATOM | % WORD_ATOM;
WORD_ATOM     -> WORD | WORD_REGULAR_EXPRESSION | POS_TAG |
                 ! WORD | ! WORD_REGULAR_EXPRESSION | ! POS_TAG;

TO DO ...

The module cannot handle isolated linkages yet, but patches are always welcome. I also need to clean up some part of code. Besides, the interface is so bad for now.

SEE ALSO

Lingua::LinkParser

COPYRIGHT AND LICENSE

Copyright (C) 2004 by Yung-chung Lin (a.k.a. xern) <xern@cpan.org>

This library is free software; Redistribution and/or modification under the same terms as Perl itself is allowed.