NAME
Lingua::EN::Dict - BETA Version of XML english dictionary storage.
SYNOPSIS
use Lingua::EN::Dict;
my $dict = Lingua::EN::Dict->new('words.xml');
my $part_of_speech = $dict->type('abash');
my $verb_tense = $dict->tense('zoomed');
my $flag1 = $dict->is_verb('utilizes');
my $flag2 = $dict->is_verb('utilized');
my $flag3 = $dict->is_verb('utilizing');
my @synonyms = $dict->syns('dictate');
my @antonyms = $dict->opps('valid');
my $defenition = $dict->defn('vindicate');
undef $dict;
$dict = Lingua::EN::Dict->new(
server => e.tdcj.com
port => 7778,
}
# defaults to local file 'words.xml' if it
# cannot reach server.
# everything in first paragraph works here too
undef $dict;
$dict = Lingua::EN::Dict->new(
server => localhost
port => 7778,
}
# everything in first paragraph works here too
undef $dict;
$dict = Lingua::EN::Dict->new;
# same as above consructor, defaults to local file
# 'words.xml' if it cannot reach server.
# everything in first paragraph works here too
DESCRIPTION
Note: BETA VERSION.
See main reason for release of this module, three paragraphs down.
- Description
-
This is a small module I came up with to use as a storage format for my humble attempt at a natural language parser (or a subset of natural language - english that is). This is a seperate module that stores the words in an xml-format file. With the distribution file, you should have received an XML file called 'words.xml' that contains almost 3000 words consiting of several hundred verbs (not counting the seperate forms of each of the verbs), as well as several hundred nouns, and adjectives, articles, and modals.
This module was created for the storage and retrieval of words from the XML file. It parses the XML file with XML::Parser and stores the words in a blessed hash refrence which is returned from the new constructor. This means that after you have loaded the dictionary, assuming you used the default XML file, you can access the properties of the word 'abash' with this: my $info = $dict->{'abash'}; $info will now contain a hash refrence to a structure of information about the word 'abash'. $info will always have at least one key, 'type'. 'type' indicates the part of speech that the word is. What keys $info contains depends on the type of word 'abash' is. If it is a verb, $info will contain the keys 'third', 'past', 'part', and 'gerund', and possible, 'defn'. If it is a modal, it will have a key named 'modal_type'. Since this is a beta version, I wont go into too much more detail here. Experiment and enjoy. Look at the default 'words.xml' file for an idea of the structure. Each tag inside a <record></record> pair is stored under that tags name as a key in $info, with a few exceptions, of course.
- Reason for Release
-
The main reason for the beta relerase of this module is this: I would like any and all feedback on the TCP server setup that I have added to this module.
I often got fed-up with having to wait 20 - 40 seconds for the new() constructor to load and parse the entire 590k of words just to run a simple 2 line test script. And since I like to tweek and run, tweek and run (the life of a Perl programmer, eh? :-), it was really annoying to have to wait 30 seconds for each test to run, when the actual test script took less than 50ms to run. Sooooo... I added a simple TCP transfer setup for the dictionary.
To invovke a server process for the dictionary, simply use this one-liner:
% perl -MLingua::EN::Dict -e daemon
daemon() is a function automatically exported by this module for just this purpose. It binds a TCP server to port 7778, accepting input from any IP address and loads the file 'words.xml' into a dictionary object for serving.
To create a client for this server, simply use:
my $dict = new Lingua::EN::Dict;
This automatically tries to connect to the server on port 7778 of 'localhost'. If it cannot connect to the server, it emits a warning and proceeds to try to load the default file 'words.xml'.
There are other options for the new constructor, as well as other options to the daemon() function. For both, see below.
The reason I released this beta version was to get input from those of you who might have some idea of how to make sure I don't leave any security holes in the TCP server portion. Because, for example, it is possible:
my $dict = new Lingua::EN::Dict( server => remote.server.system.com, port => 7778 );
The new constructor allows you to specify the server name and port to connect to. This would allow a central dictionary server to be setup on some server (I have a server I am setting up for that purpose) and that would allow other users of this module to access a much larger database which can be updated by many people, instead of each user having to maintain his own copy of the words database. (Yes, I know I'll need to add forking to the daemon(), but this is just development, I'll keep it simple for now. Future release I'll add forking.)
Since it does allow remote users the ability to TCP into the daemon, I know that security checks need to be added to plug any potential holes. What I don't know is exactly what holes exist and how to plug them.
HELP, PLEASE! Anyone who does know anything about TCP or security of such, please take a look at the daemon() function code and let me know anything that I need to do to make it secure.
CONSTRUCTOR
my $dict = Lingua::EN::Dict->new;
This is the default, most basic way to create a dictionary object. It will automatically attempt to connect to the server on localhost:7778. If it doesn't find a server that gives the expected response on that port, it will emit a warning and try to load the file in 'words.xml' or whatever file was passed in the 'file' tag (below) and pare that. If it cannot load that file, or if the file doesn't exist, there is no warnings. The constructor returns a blessed refrence to a new dictionary object.
You can turn the warning about the inability to connect to the server if you want to by simply defining the option tag 'warn_off'.
my $dict = Lingua::EN::Dict->new( warn_off => 1 );
This will let it just silently try the server without emitting a warning if it cannot find the server.
You can also tell it to explictly load from local disk without trying to connect to the server by explictly passing a filename to the constructor as in this example:
my $dict = Lingua::EN::Dict->new( file => 'words.xml' );
You may also tell it explictly what server to try to connect to with the 'server' and 'port' tags.
my $dict = Lingua::EN::Dict->new(
server => remote.server.com,
port => 7778
);
You can also tell it explictly what file to load if it cannot reach the server with the file tag.
my $dict = Lingua::EN::Dict->new(
server => remote.server.com,
port => 7778,
file => 'mydictionary.xml',
);
Quick example of how to download an entire database and save it locally:
% perl -MLingua::EN::Dict -e 'Linga::EN::Dict->new(server=>"remote.server",
port=>7778)->sync("down",1)->save("somefile.xml") or die "Error in sync/save."'
This tries to login to the specified server and download the entire database from that server (with the sync() command), printing out its progress as it goes. Then it writes the entire database in XML format locally to the specified file, as well as checking for errors.
METHODS
- $dict->save( [ $file ] );
-
This writes out the dictionary in XML format to the file loaded from or $file if specified. The only writes out IF (a) the dictionary has been modified by add() or link, or (b) it has had sync("down",...) called on it. If it does not write out, it will return undef, otherwise it returns $dict.
If $file is NOT specified, AND youn are connected to a dictionary server, save() will loop through the internal array refrence at $dict->{_}->{'.modified'}, where each key contains the name of a word that was modified with link() or add(). save() loops thru this array and sends the info for each word modified to the server and then sends a DICT-WRITEOUT command to the server to write the database at the server end to disk.
- $dict->types($type);
-
This returns an array refrence to all the words that have the same type as $type. Note, the elements of the array are a scalar value containing the name of the word, NOT a hash refrence to the info.
- $dict->type($word);
-
This returns the 'type' entry for word $word. It is almost the same as:
$type = $dict->{$word};
Why dont you want to use that code? Well, type() automatically checks to see if you are using a server connection, and if the word has been retrieved from the server yet. If the word hasn't been retrieved, it automatically retrieves the word and caches it locally, THEN it returns the type. If you were to call the above code example on a word that hadn't been retrieved yet, you would simply get undef.
See also retrieve().
- $dict->is($word,$type);
-
Comparse the type of $word with $type, returning 1 if $word is type $type, undef otherwise. This also automatically retrieves and caches words from the server in a server usage situation. See above explanation for type().
- $dict->add($word,$type);
-
This will add $word to the dictionary with type $type. If $type is 'verb', add() will attempt to conjugate the verb into infinitive, past, present, part, and gerund parts. add() will return $dict upon completion and add $word to the internal 'modified' list.
- $dict->link($from, $to [,$rel]);
-
link() adds a relation link which appears as a <link> tag in the XML file. link() adds a <link> entry in word $from to the word $to. $rel is an optional variable specifying the 'relation' attribute of the link. If $rel is not specified, $rel defaults to 'synonym'.
- $dict->retrieve($word);
-
This attempts to retrieve the word information for $word from the dictionary server and cache it in $dict, if connected. If it is not connected, it will return undef, otherwise it will return a hash refrence to the info for the word.
- $dict->sync( [ $dir, $verbose ] );
-
If sync() is called without any arguments, $dir defaults to "down". The valid values for $dir are "down" or "up". If $verbose is defined (non-zero) it will print percentage done followed by "\r".
If $dir eq "down", sync() will query the server for a list of all the words in the database and then it will call retrieve() on each of the words in the list.
If $dir eq "up", sync() will loop thru all the words in the "modified" list stored internally of words that were modified by add() or link(), and it will upload those words to the server, followed by a DICT-WRITEOUT command to the server.
sync() is used internally by $dict->save().
If $verbose is defined, it prints a string in the form of: 50% (50 of 100 word) With appropriate values substituted for the numbers, of course.
- $dict->tense($word);
-
This searches the verbs in the dictionary, comparing $word to the known tenses of that verb (Infinitive, Past, Past Participle, Third Person, Gerund). If $word matches any of the tenses, tense() will return the name of which tense it matched ("infinitive", "past", "part", or "gerund"). If $verb doesn't match any of the verbs in the dictionary, tense() returns undef.
This automatically detects if you are using the dictionary connected to a dictionary server. If you are, then it sends the tense() request to be run at the server. The server will scan the dictionary file and send the results back. All this is done transparently, so tense() will run the same wether you are connected to a server or not.
- $dict->syns($word);
- $dict->opps($word);
-
Both of these functions return arrays, not array refs. syns() finds all the synonyms for the $word. opps() finds all the antonyms for $word.
You can add more synonyms or antonyms with the link() method. Example:
$dict->link('cat', 'dog', 'opposite'); $dict->link('scream', 'yell', 'synonym');
See the link() method for more information on syntax.
- $dict->defn($word);
-
Returns a scalar containing the definition entry for word $word. Note: Words may not have a definition entry.
- $dict->is_verb($word);
- $dict->is_past($word);
- $dict->is_part($word);
- $dict->is_third($word);
- $dict->is_gerund($word);
-
These five functions test verbs for the specified part. Ex. is_gerund($word) tests if $word is in Gerund form. Uses tense() internally. Returns undef on failure, otherwise returns a defined value for truth.
- $dict->is_noun($word);
- $dict->is_adv($word);
- $dict->is_pnoun($word);
- $dict->is_art($word);
- $dict->is_adj($word);
- $dict->is_conj($word);
- $dict->is_prep($word);
-
These test word $word for the type indicated by the function name. The abreviations are as follows:
noun => noun adv => adverb pnoun => pronoun art => article adj => adjective conj => conjunction prep => preposition
Returns undef for false, otherwise returns a defined value for true.
daemon( [ $file,$port,$addr ] );
This is an automatically exported function. When called with no arguments, it binds to port 7778 and loads file 'words.xml' in a dictionary object to be served. It, by default, accepts requests from any IP address. To bind to a specific IP, use: daemon(undef,undef,'206.70.2.13');
Example usage:
% perl -MLingua::EN::Dict -e daemon
This starts a server process on port 7778, letting requests from any IP. To specify a different file to load, use:
% perl -MLingua::EN::Dict -e daemon("myfile.xml");
Or, for multiple network adapters, you can use one network adapter for the server (at least, thats how my development box is: I use one network card for my dictionary server, one for my main inet connection, and one for my LAN). Anyways, to specify port and IP, as well as file, do:
% perl -MLingua::EN::Dict -e daemon("myfile.xml",3000,'206.70.2.13');
This function is what I need feedback on. Please take a look at the source to see if there are any security holes that need patched, or anyother security-related problems. Thankyou ahead of time!
EXPORT
daemon();
AUTHOR
Josiah Bryan <jdb@wcoil.com>
Copyright (c) 2000 Josiah Bryan. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The Lingua::EN::Dict
and related modules are free software. THEY COME WITHOUT WARRANTY OF ANY KIND.
DOWNLOAD
You can always download the latest copy of Lingua::EN::Dict from http://www.josiah.countystart.com/modules/get.pl?dict:pod
SEE ALSO
Nothing to see also here... move along now.
5 POD Errors
The following errors were encountered while parsing the POD:
- Around line 497:
=begin without a target?
- Around line 545:
'=item' outside of any '=over'
- Around line 631:
You forgot a '=back' before '=head1'
- Around line 686:
'=item' outside of any '=over'
- Around line 845:
You forgot a '=back' before '=head1'