NAME

TM::Corpus::Document - Topic Maps, Document

SYNOPSIS

use TM::Corpus::Document;
my $d = new TM::Corpus::Document ({ mime => 'text/plain',
                                    val  => 'this is some text' });
# accessors
$val  = $d->val ('new text');
$mime = $d->mime ('new/mime');
$url  = $d->ref ('http://somewhere/some.txt');

my @tokens = $d->tokenize; # leaving defaults

# using some predefined tokenizing steps, in this order
my @tokens = $d->tokenize (tokenizers => 'NUMBER QUOTER COM&BO');

# using negative ones (i.e. throw things away)
my @tokens = $d->tokenize (tokenizers => 'COM&BO COM-BO -INTERPUNCT');

# using filters (detect numbers and throw them away
my @tokens = $d->tokenize (tokenizers => 'NUMBER !NUMBER');

# get also debugging output
my @tokens = $d->tokenize (tokenizers => 'NUMBER TAP !NUMBER TAP');

# define your own filters
$TM::Corpus::Document::FILTERS{'!4LETTER'} = 
             sub { $_ = shift; return length($_) == 4 ? '' : $_; };
my @tokens = $d->tokenize (tokenizers => 'WORDER !4LETTER');

# collect features, here single tokens and two subsequent tokens
my %features = $d->features (tokenizers  => '...', 
                             featurizers => 'TOKEN1 TOKEN2')

ABSTRACT

This package implements documents, i.e. document pertinent information, such as its content, the corresponding MIME type, maybe a reference to the document if it has one.

Most notable is functionality to find the tokens (i.e. word substrings) and derive from these also a feature vector for the document.

DESCRIPTION

INTERFACE

Constructor

The constructor expects a hash reference with one or more of the following fields:

ref: A URI string to refer to the network address of the document. In Topic Maps parlor this will be the subject locator for the document topic.
val: The character stream associated with the document.
mime: The MIME type of the content.

Methods

ref

Accessor for the ref component of the document. Nothing happens with the other components.

val

Accessor for the val component of the document. Nothing happens with the other components.

mime

Accessor for the mime component of the document. Nothing happens with the other components.

tokenize

This method returns a list reference to recognized tokens.

To generate this, the method will first find an extractor according to the document's MIME type. That will extract text, but also relevant meta data, such as title, length, etc. Some extractors are predefined; you can get a list with

perl -MTM::Corpus::Document -e 'warn join ",", keys %TM::Corpus::Document::EXTRACTORS;'

The extractor can also be overridden:

$d->tokenize (extractor => sub { ... });

It gets the value (content) as first parameter.

In a second step the content stream of the document is analyzed for patterns, such as numbers, dates or words. To control from the outside what is relevant and what should be done in which order, this is specified with a simple language.

Example:

$d->tokenize (tokenizers => 'COM&BO COM-BO');

Positive tokenizers detect patters and bless them as valid tokens which will not be further analyzed or questioned:

WORDER: detects word in current locale
QUOTER: detects substrings wrapped in ""
NUMBER: detects decimal numbers
DATE: detects date specification in current locale (NOT IMPLEMENTED!)
COM&BO: detects patterns like AT&T
COM-BO: detects patterns like T-Mobile
Capitalize: detects capitalized words

Negative tokenizers detect patterns and immediately throw them away:

-WORDER: everything which is left as text fragment is suppressed
-QUOTER: quoted text is suppressed
-NUMBER: decimal numbers are suppressed
-INTERPUNCT: interpunctations characters are suppressed

Filters take existing tokens and either modify then, suppress them or pass them through (and suppress everything else).

!NUMBER: number tokens are replaced with empty tokens

You can override and extend tokenizers and filters by tampering with the hashes %TOKENIZERS and %FILTERS. You can hook in, for instance a stopword list like this:

my %stops =  map { $_ => 1 } qw(Terror CIA HLS);
$TM::Corpus::Document::FILTERS{'!STOPS'} = 
             sub { $_ = shift; return $stops{$_} ? '' : $_; };

$d->tokenize (tokenizers => ' .... !STOPS ....');

features

This method computes the feature vector from a document. It accepts all parameters from method tokenize as it will invoke this first. Additionally you can specify how to tokenize

my %fv = $d->features (tokenizers  => 'QUOTER NUMBER WORDER', 
                       featurizers => 'TOKEN1 TOKEN2');

Following tokenizers are defined:

TOKEN1: occurrences of single tokens are counted in the document
TOKEN2: occurrences of two subsequent tokens in the document are counted
TOKEN3: group of 3 are counted
MIME: the MIME type is converted into some numeric value

You can extend or modify the %FEATURIZERS hash to add your own featuritis.

NOTES

No. Plucene tokenizing was NOT helpful.

COPYRIGHT AND LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

To install TM::Corpus, copy and paste the appropriate command in to your terminal.

cpanm

cpanm TM::Corpus

CPAN shell

perl -MCPAN -e shell
install TM::Corpus

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)