NAME

Circa::Parser - provide functions to parse HTML pages by Circa

SYNOPSIS

use Circa::Indexer;
my $index = new Circa::Indexer;
$index->connect(...);
#$index->Parser->look_at($url,account);

DESCRIPTION

This module use HTML::Parser facilities. It's call by Circa::Indexer for index each document. Main method is look_at.

VERSION

$Revision: 1.8 $

Public Class Interface

new($indexer_instance)

Create a new Circa::Parser object with indexer instance properties

tag

Method call for each HTML tag find in HTML pages.

text

Method call for each content of tag in HTML pages

look_at ($url,$idc,$idr,$lastModif,$url_local, $categorieAuto,$niveau,$categorie)

Index an url. Job done is:

  • Test if url used is valid. Return -1 else

  • Get the page and add each words found with weight set in constructor.

  • If maximum level of links is not reach, add each link found for the next indexation

Parameters:

  • $url : Url to read

  • $idc: Id of url in table links

  • $idr : Id of account's url

  • $lastModif (optional) : If this parameter is set, Circa didn't make any job on this page if it's older that the date.

  • $url_local (optional) Local url to reach the file

  • $categorieAuto (optional) If $categorieAuto set to true, Circa will create/set the category of url with syntax of directory found. Ex:

    http://www.alianwebserver.com/societe/stvalentin/index.html will create and set the category for this url to Societe / StValentin

    If $categorieAuto set to false, $categorie will be used.

  • $niveau (optional) Level of actual link.

  • $categorie (optional) See $categorieAuto.

Return (-1,0) if url isn't valide, number of word and number of links found else

set_agent($locale)

Set user agent for Circa robot. If $locale is ==0 or $self->{ConfigMoteur}->{'temporate'}==0, LWP::UserAgent will be used. Else LWP::RobotUA is used.

analyse($data,$facteur,%l)

Recupere chaque mot du buffer $data et lui attribue une frequence d'apparition. Les resultats sont ranges dans le tableau associatif passé en paramètre. Les résultats sont rangés sous la forme %l=('mots'=>facteur).

  • $data : buffer à analyser

  • $facteur : facteur à attribuer à chacun des mots trouvés

  • %l : Tableau associatif où est rangé le résultat

Retourne la référence vers le hash

check_links($tag,$links)

Check if url $links will be add to Circa. Url must begin with $self->host_indexed, and his extension must be not doc,zip,ps,gif,jpg,gz, pdf,eps,png,deb,xls,ppt,class,GIF,css,js,wav,mid.

If $links is accepted, return url. Else return 0.

AUTHOR

Alain BARBET alian@alianwebserver.com

1 POD Error

The following errors were encountered while parsing the POD:

Around line 486:

Non-ASCII character seen before =encoding in 'passé'. Assuming CP1252