NAME
nsp2regex.pl Convert Text-NSP output into a file of regular expressions
SYNOPSIS
Takes n-word sequences and represents them as regular expressions. These can then be used to identify lexical features in a given data, and convert a lexical element files from text into feature vectors.
USAGE
nsp2regex.pl [OPTIONS] SOURCE [[, SOURCE] ...]
INPUT
Required Arguments:
SOURCE
The SOURCE is a file containing the list of features. The features are required to be in specific format:
the_feature_token<>
eg: Unigram feature: temperature<> Bigram feature: daily<>temperature<>
count.pl or statistic.pl (both part of the Ngram Statistics Package) created output can be directly used as the SOURCE file.
Optional Arguments
--token FILE
Uses tokens contained in FILE to create the separator between tokens, when window size of SOURCE n-gram is greater than the 'n' of the n-gram. Window sizes for n-grams in SOURCE can be defined using the --extended option in count.pl.
--version
Prints the version number.
--help
Prints this help message.
OUTPUT
Outputs the generated regular expressions to stdout.
Explanation of the created Regular Expressions
Default Regular Expression (without Skipping Intermediate Tokens):
By default nsp2regex.pl creates regex's that match space separated tokens. The regular expressions that nsp2regex.pl creates are based on the assumption that the text on which these regex's are going to be used has tokens separated by a single space. Further the regular expressions thus created ignore XML tags and non-tokens, as described in the examples above.
For example, the following line in the input to nsp2regex.pl: a<>bigram<>
is converted to the following regex: /\s(<[^>]*>)*a(<[^>]*>)*\s(<[^>]*>\s)*(<[^>]*>)*bigram(<[^>]*>)*\s/ @name = a<>bigram
In this output, everything from the first / to the last / constitutes the regular expression. The portion "@name = a<>bigram" is used by xml2arff.pl (from SenseTools package) for giving a name to the attribute corresponding to this regular expression.
What This Regular Expression will Match:
This regular expression defines a feature that will match the tokens "a" and "bigram" under the following conditions:
i> Tokens "a" and "bigram" have exactly one space to their left and right. For example, this regex will match the sentence " this is a bigram ". This regex will not match the sentence " i wanna bigram " nor the sentence " i have a bigrams ". It will not even match " I have a bigram ". This is because nsp2regex.pl creates regular expressions that assume that there is exactly ONE space character between tokens!
ii> Tokens "a" and "bigram" are bounded by one or more xml tags or non-tokens, that is a sequence of characters that start with '<' and end with '>'. eg: this regex will match the sentence : " this is a <head>bigram</head> ". This regex will also match " this is a <head>bigram<senseid=20/></head> ".
iii> tokens "a" and "bigram" are separated by one or more space separated xml tags. eg: this regex will match the sentence " this is a <,> bigram ". It will also match " this is a <,> bigram <!> " and " this is a <,> <head>bigram</head> ".
iv> combinations of the above cases.
Explanation of this Regular Expression:
Following is an explanation of the various parts of the regular expression:
/\s(<[^>]*>)*a(<[^>]*>)*\s(<[^>]*>\s)*(<[^>]*>)*bigram(<[^>]*>)*\s/ @name = a<>bigram
a> All the portion between the first '/' and the last '/' is the regular expression.
b> The regular expression starts with requiring a single space character, \s. This is consistent with the assumption that every token has exactly one space to its left and one to its right.
c> The next chunk is (<[^>]*>)*a(<[^>]*>)* Note that the portion (<[^>]*>) represents exactly our definition of an XML tag, namely that it should start with a '<', have 0 or more characters, except the '>' character, and then end with the '>' character. The '*' outside the bracket denotes that we are willing to match 0 or more such tags. After that, we wish to match a single occurrence of the first token, 'a', again followed by 0 or more tags. Note that the tags are "stuck" to the token 'a', in that there is no space between the tag and the token 'a'. Of course if in the text there is a space between an XML tag and 'a', then the space would match the space in <b> above.
d> Having matched token 'a' with 0 or more tags "stuck" to its right and left, we now wish to match exactly a single space character through the \s. Again this corresponds to our assumption that tokens in the text are separated by exactly one space character!
e> The next chunk (<[^>]*>\s)* is again our familiar XML tag. This time we wish to "skip" over 0 or more occurrences of any XML tag that lie between the first and the second token, ie between 'a' and 'bigram'. Since these are not "stuck" to the next token 'bigram', they are space separated from each other and from 'bigram'. Hence, for every token we match, we also match a space character!
f> The next chunk is (<[^>]*>)*bigram(<[^>]*>)* which is exactly like the chunk for 'a' in point <c> above.
g> Finally we wish to match a single space character \s.
h> The portion after the last '/' @name = a<>bigram creates a "name" for this feature. This name is used by xml2arff (from SenseTools package) while creating the vector output of the input XML file. While this name is not necessary, it makes the vector output more human-readable.
Regular Expression with Skipping of Intermediate Tokens:
nsp2regex.pl can create regular expressions that ignore one or more tokens that occur between the tokens to be matched. This can be switched "ON" by having the directive "@count.WindowSize=..." in the input file to nsp2regex.pl. We need to provide nsp2regex.pl with the same token file we provide preprocess.pl... say following is the token file:
/<head>\w+<\/head>/ /\w+/
Let the input file to the nsp2regex.pl program be the following:
@count.WindowSize=3 a<>bigram<>
then, the output regular expression from nsp2regex.pl is:
/\s(<[^>]*>)*a(<[^>]*>)*\s(<[^>]*>\s)*((<[^>]*>)*((<head>\w+<\/head>)|(\w+))(<[^>]*>)*\s(<[^>]*>\s)*){0,1}(<[^>]*>)*bigram(<[^>]*>)*\s/ @name = a<>bigram<>1
What This Regular Expression will Match:
This regular expression will match the tokens "a" and "bigram" separated by 0 or 1 occurrences of the white space separated token ((<head>\w+<\/head>)|(\w+)). This is the token definitions obtained from the token.txt file above!
For example, this regular expression will match the following sentences: " this is a funny bigram " " this is a bigram " " this is a <head>nice</head> bigram " " this is a <,> bigram " " this is a <,> <head>nice</head> bigram "
This regular expression will not match: " this is a really big bigram ", " i wanna write bigram ". " this is a , bigram ",
Explanation of this Regular Expression:
Following is a description of various parts of the regular expression:
/\s(<[^>]*>)*a(<[^>]*>)*\s(<[^>]*>\s)*((<[^>]*>)*((<head>\w+<\/head>)|(\w+))(<[^>]*>)*\s(<[^>]*>\s)*){0,1}(<[^>]*>)*bigram(<[^>]*>)*\s/ @name = a<>bigram<>1
On careful observation one will notice that the above regular expression differs from the previous regular expression (section 6.1.2) in only one portion.
Specifically the portion \s(<[^>]*>)*a(<[^>]*>)*\s(<[^>]*>\s)* is the same as above... this matches a space, followed by 'a' with XML tags or non-token characters (within <> brackets) stuck to its left and right, followed by a single space, followed by 0 or more XML tags and non-token characters, with a space after every such tag.
Further note that the portion (<[^>]*>)*bigram(<[^>]*>)*\s is again the same as before... they match 'bigram' with XML tags and non-token character tags stuck to its left and right, followed by a single space.
Thus the only "new" portion in this regex is ((<[^>]*>)*((<head>\w+<\/head>)|(\w+))(<[^>]*>)*\s(<[^>]*>\s)*){0,1}
We call this the "separator" portion of the regex; this is the portion that allows for the "ignoring" of up to one token between the tokens 'a' and 'bigram'. This token can be either a <head>\w+</head> or a \w+.
a> Observe that the entire section is within a pair of round brackets, followed by a {0,1}. This says that this portion is allowed to occur 0 or 1 times. This is consistent with the window size of 3... besides 'a' and 'bigram', we allow at most one other token to come into the window. If our window size were to be 10 say, this would be {0,8}.
b> The first part inside this bracketed portion is (<[^>]*>)*((<head>\w+<\/head>)|(\w+))(<[^>]*>)*. This says that we are willing to match either a <head>\w+</head> or a \w+. Further whatever we match can be preceeded or followed by an XML tag or a non-token character ensconced with the angular brackets <>.
c> Having matched either of the two options, we wish to match a single space, \s, followed by one or more XML tags or non-tokens, in keeping with our desire to skip these tags!
e> And, as mentioned in <a> above, we would like to do this matching at most once, that is there will be at most one such token between 'a' and 'bigram'.
f> The name of the feature has also changed to @name = a<>bigram<>1 implying that we are allowing at most one token to come in between our two main tokens!
A Fine Point about nsp2regex.pl:
Fine Point 1: Certain characters, like '.', '*', '?' etc have special meaning when used within a regular expression. If these characters occur in the tokens that the regular expression is being built from, they are "escaped" (by prepending them with a slash '\'). Following is a list of characters that are so escaped: '\', '/', '|', '(', ')', '[', ']', '{', '}', '^', '$', '*', '+', '?' and '.'
AUTHOR
Satanjeev Banerjee, Carnegie Mellon University, Pittsburgh.
Ted Pedersen, University of Minnesota, Duluth.
COPYRIGHT
Copyright (c) 2001-2005,
Satanjeev Banerjee, Carnegie Mellon University, Pittsburgh.
satanjeev@cmu.edu
Ted Pedersen, University of Minnesota, Duluth.
tpederse@umn.edu
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to
The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.