NAME
TM::Corpus::Document - Topic Maps, Document
SYNOPSIS
use TM::Corpus::Document;
my $d = new TM::Corpus::Document ({ mime => 'text/plain',
val => 'this is some text' });
# accessors
$val = $d->val ('new text');
$mime = $d->mime ('new/mime');
$url = $d->ref ('http://somewhere/some.txt');
my @tokens = $d->tokenize; # leaving defaults
# using some predefined tokenizing steps, in this order
my @tokens = $d->tokenize (tokenizers => 'NUMBER QUOTER COM&BO');
# using negative ones (i.e. throw things away)
my @tokens = $d->tokenize (tokenizers => 'COM&BO COM-BO -INTERPUNCT');
# using filters (detect numbers and throw them away
my @tokens = $d->tokenize (tokenizers => 'NUMBER !NUMBER');
# get also debugging output
my @tokens = $d->tokenize (tokenizers => 'NUMBER TAP !NUMBER TAP');
# define your own filters
$TM::Corpus::Document::FILTERS{'!4LETTER'} =
sub { $_ = shift; return length($_) == 4 ? '' : $_; };
my @tokens = $d->tokenize (tokenizers => 'WORDER !4LETTER');
# collect features, here single tokens and two subsequent tokens
my %features = $d->features (tokenizers => '...',
featurizers => 'TOKEN1 TOKEN2')
ABSTRACT
This package implements documents, i.e. document pertinent information, such as its content, the corresponding MIME type, maybe a reference to the document if it has one.
Most notable is functionality to find the tokens (i.e. word substrings) and derive from these also a feature vector for the document.
DESCRIPTION
INTERFACE
Constructor
The constructor expects a hash reference with one or more of the following fields:
ref
-
A URI string to refer to the network address of the document. In Topic Maps parlor this will be the subject locator for the document topic.
val
-
The character stream associated with the document.
mime
-
The MIME type of the content.
Methods
- ref
-
Accessor for the
ref
component of the document. Nothing happens with the other components. - val
-
Accessor for the
val
component of the document. Nothing happens with the other components. - mime
-
Accessor for the
mime
component of the document. Nothing happens with the other components. - tokenize
-
This method returns a list reference to recognized tokens.
To generate this, the method will first find an extractor according to the document's MIME type. That will extract text, but also relevant meta data, such as title, length, etc. Some extractors are predefined; you can get a list with
perl -MTM::Corpus::Document -e 'warn join ",", keys %TM::Corpus::Document::EXTRACTORS;'
The extractor can also be overridden:
$d->tokenize (extractor => sub { ... });
It gets the value (content) as first parameter.
In a second step the content stream of the document is analyzed for patterns, such as numbers, dates or words. To control from the outside what is relevant and what should be done in which order, this is specified with a simple language.
Example:
$d->tokenize (tokenizers => 'COM&BO COM-BO');
Positive tokenizers detect patters and bless them as valid tokens which will not be further analyzed or questioned:
WORDER
: detects word in current localeQUOTER
: detects substrings wrapped in ""NUMBER
: detects decimal numbersDATE
: detects date specification in current locale (NOT IMPLEMENTED!)COM&BO
: detects patterns like AT&TCOM-BO
: detects patterns like T-MobileCapitalize
: detects capitalized words
Negative tokenizers detect patterns and immediately throw them away:
-WORDER
: everything which is left as text fragment is suppressed-QUOTER
: quoted text is suppressed-NUMBER
: decimal numbers are suppressed-INTERPUNCT
: interpunctations characters are suppressed
Filters take existing tokens and either modify then, suppress them or pass them through (and suppress everything else).
You can override and extend tokenizers and filters by tampering with the hashes
%TOKENIZERS
and%FILTERS
. You can hook in, for instance a stopword list like this:my %stops = map { $_ => 1 } qw(Terror CIA HLS); $TM::Corpus::Document::FILTERS{'!STOPS'} = sub { $_ = shift; return $stops{$_} ? '' : $_; }; $d->tokenize (tokenizers => ' .... !STOPS ....');
features
-
This method computes the feature vector from a document. It accepts all parameters from method tokenize as it will invoke this first. Additionally you can specify how to tokenize
my %fv = $d->features (tokenizers => 'QUOTER NUMBER WORDER', featurizers => 'TOKEN1 TOKEN2');
Following tokenizers are defined:
TOKEN1
: occurrences of single tokens are counted in the documentTOKEN2
: occurrences of two subsequent tokens in the document are countedTOKEN3
: group of 3 are countedMIME
: the MIME type is converted into some numeric value
You can extend or modify the
%FEATURIZERS
hash to add your own featuritis.
NOTES
No. Plucene tokenizing was NOT helpful.
SEE ALSO
COPYRIGHT AND LICENSE
Copyright 200[8] by Robert Barta, <drrho@cpan.org>
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.