NAME

Search::Circa::Parser - provide functions to parse HTML pages by Circa

SYNOPSIS

use Search::Circa::Indexer;
my $index = new Search::Circa::Indexer;
$index->connect(...);
$index->Parser->look_at($url,account);

DESCRIPTION

This module use HTML::Parser facilities. It's call by Search::Circa::Indexer for index each document. Main method is look_at.

VERSION

$Revision: 1.19 $

Public Class Interface

new($indexer_instance)

Create a new Circa::Parser object with indexer instance properties

look_at ($url,$idc,$idr,$lastModif,$url_local, $categorieAuto,$niveau,$categorie)

Index an url. Job done is:

  • Test if url used is valid. Return -1 else

  • Get the page and add each words found with weight set in constructor.

  • If maximum level of links is not reach, add each link found for the next indexation

Parameters:

  • $url : Url to read

  • $idc: Id of url in table links

  • $idr : Id of account's url

  • $lastModif (optional) : If this parameter is set, Circa didn't make any job on this page if it's older that the date.

  • $url_local (optional) Local url to reach the file

  • $categorieAuto (optional) If $categorieAuto set to true, Circa will create/set the category of url with syntax of directory found. Ex:

    http://www.alianwebserver.com/societe/stvalentin/index.html will create and set the category for this url to Societe / StValentin

    If $categorieAuto set to false, $categorie will be used.

  • $niveau (optional) Level of actual link.

  • $categorie (optional) See $categorieAuto.

Return (-1,0) if url isn't valide, number of word and number of links found else

set_agent($locale)

Set user agent for Circa robot. If $locale is ==0 or $self->{ConfigMoteur}->{'temporate'}==0, LWP::UserAgent will be used. Else LWP::RobotUA is used.

analyse($data,$facteur,%l)

Recupere chaque mot du buffer $data et lui attribue une frequence d'apparition. Les resultats sont ranges dans le tableau associatif passé en paramètre. Les résultats sont rangés sous la forme %l=('mots'=>facteur).

  • $data : buffer à analyser

  • $facteur : facteur à attribuer à chacun des mots trouvés

  • %l : Tableau associatif où est rangé le résultat

Retourne la référence vers le hash

tag

Method call for each HTML tag find in HTML pages.

text

Method call for each content of tag in HTML pages

check_links($tag,$links)

Check if url $links will be add to Circa. Url must begin with $self->host_indexed, and his extension must be not doc,zip,ps,gif,jpg,gz, pdf,eps,png,deb,xls,ppt,class,GIF,css,js,wav,mid.

If $links is accepted, return url. Else return 0.

AUTHOR

Alain BARBET alian@alianwebserver.com

1 POD Error

The following errors were encountered while parsing the POD:

Around line 514:

Non-ASCII character seen before =encoding in 'passé'. Assuming CP1252