NAME

Circa::Indexer - provide functions to administrate Circa, a www search engine running with Mysql

SYNOPSIS

use Circa::Indexer;
my $indexor = new Circa::Indexer;

if (!$indexor->connect_mysql($user,$pass,$db))
 {die "Erreur à la connection MySQL:$DBI::errstr\n";}

$indexor->create_table_circa;

$indexor->drop_table_circa;

$indexor->addSite("http://www.alianwebserver.com/",
                  'alian@alianwebserver.com',
                  "Alian Web Server");

my ($nbIndexe,$nbAjoute,$nbWords,$nbWordsGood) = $indexor->parse_new_url(1);
print   "$nbIndexe pages indexées,"
  "$nbAjoute pages ajoutées,"
  "$nbWordsGood mots indexés,"
  "$nbWords mots lus\n";

$indexor->update(30,1);

Look in admin.pl,admin.cgi,admin_compte.cgi

DESCRIPTION

This is Circa::Indexer, a module who provide functions to administrate Circa, a www search engine running with Mysql. Circa is for your Web site, or for a list of sites. It indexes like Altavista does. It can read, add and parse all url's found in a page. It add url and word to MySQL for use it at search.

This module provide routine to :

  • Add url

  • Create and update each account

  • Parse url, Index words, and so on.

  • Provide routine to administrate present url

Remarques:

  • This file are not added : doc,zip,ps,gif,jpg,gz,pdf,eps,png, deb,xls,ppt,class,GIF,css,js,wav,mid

  • Weight for each word is in hash $ConfigMoteur

Features ?

Features

  • Search Features

    • Boolean query language support : or (default) and ("+") not ("-"). Ex perl + faq -cgi : Documents with faq, eventually perl and not cgi.

    • Client Perl or PHP

    • Can browse site by directory / rubrique.

    • Search for different criteria: news, last modified date, language, URL / site.

  • Full text indexing

  • Different weights for title, keywords, description and rest of page HTML read can be given in configuration

  • Herite from features of LWP suite:

    • Support protocol HTTP://,FTP://, FILE:// (Can do indexation of filesystem without talk to Web Server)

    • Full support of standard robots exclusion (robots.txt). Identification with CircaIndexer/0.1, mail alian@alianwebserver.com. Delay requests to the same server for 8 secondes. "It's not a bug, it's a feature!" Basic rule for HTTP serveur load.

    • Support proxy HTTP.

  • Make index in MySQL

  • Read HTML and full text plain

  • Several kinds of indexing : full, incremental, only on a particular server.

  • Documents not updated are not reindexed.

  • All requests for a file are made first with a head http request, for information such as validate, last update, size, etc.Size of documents read can be restricted (Ex: don't get all documents > 5 MB). For use with low-bandwidth connections, or computers which do not have much memory.

  • HTML template can be easily customized for your needs.

  • Admin functions available by browser interface or command-line.

  • Index the different links found in a CGI (all after name_of_file?)

How it's work ?

Circa parse html document. convert it to text. It count all word found and put result in hash key. In addition of that, it read title, keywords, description and add a weight to all word found.

Example: my %ConfigMoteur=( 'author' => 'circa@alianwebserver.com', # Responsable du moteur 'temporate' => 1, # Temporise les requetes sur le serveur de 8s. 'facteur_keyword' => 15, # <meta name="KeyWords" 'facteur_description' => 10, # <meta name="description" 'facteur_titre' => 10, # <title></title> 'facteur_full_text' => 1, # reste 'facteur_url' => 15, # Mots trouvés dans l'url 'nb_min_mots' => 2, # facteur min pour garder un mot 'niveau_max' => 7, # Niveau max à indexer 'indexCgi' => 0, # Index lien des CGI (ex: ?nom=toto&riri=eieiei) );

<html>
<head>
<meta name="KeyWords"
      CONTENT="informatique,computing,javascript,CGI,perl">
<meta name="Description" 
      CONTENT="Rubriques Informatique (Internet,Java,Javascript, CGI, Perl)">
<title>Alian Web Server:Informatique,Société,Loisirs,Voyages</title>
</head>
<body>
different word: cgi, perl, cgi
</body>
</html>

After parsing I've a hash with that:

$words{'informatique'}= 15 + 10 + 10 =35
$words{'cgi'} = 15 + 10 +1
$words{'different'} = 1

Words is add to database if total found is > $ConfigMoteur{'nb_min_mots'} (2 by default). But if you set to 1, database will grow very quicly but allow you to perform very exact search with many worlds so you can do phrase searches. But if you do that, think to take a look at size of table relation.

After page is read, it's look into html link. And so on. At each time, the level grow to one. So if < to $Config{'niveau_max'}, url is added.

VERSION

$Revision: 1.24 $

Class Interface

Constructors and Instance Methods

new [PARAMHASH]

You can use the following keys in PARAMHASH:

author

Default: 'circa@alianwebserver.com', appear in log file of web server indexed (as agent)

temporate

Default: 1, boolean. If true, wait 8s between request on same server and LWP::RobotUA will be used. Else this is LWP::UserAgent (more quick because it doesn't request and parse robots.txt rules, but less clean because a robot must always say who he is, and heavy server load is avoid).

facteur_keyword

Default: 15, weight of word found on meta KeyWords

facteur_description

Default:10, weight of word found on meta description"

facteur_titre

Default:10, weight of word found on <title></title>

facteur_full_text

Default:1, weight of word found on rest of page

facteur_url

Default: 15, weight of word found in url

nb_min_mots

Default: 2, minimal number of times a word must be found to be added

niveau_max

Default: 7, Maximal number of level of links to follow

indexCgi

Default 0, follow of not links of CGI (ex: ?nom=toto&riri=eieiei)

size_max($size)

Get or set size max of file read by indexer (For avoid memory pb).

host_indexed($host)

Get or set the host indexed.

set_host_indexed($url)

Set base directory with $url. It's used for restrict access only to files found on sub-directory on this serveur.

proxy($adr_proxy)

Get or set proxy for LWP::Robot or LWP::Agent

Ex: $circa->proxy('http://proxy.sn.no:8001/');

Methods use for global adminstration

addSite($url,$email,$titre,$categorieAuto,$cgi,$rep,$file);

Ajoute le site d'url $url, responsable d'adresse mail $email à la bd de Circa Retourne l'id du compte cree

Create account for url $url. Return id of account created.

addLocalSite($url,$email,$titre,$local_url,$path, $urlRacine,$categorieAuto,$cgi,$rep,$file);

Add a local $url

parse_new_url($idp)

Parse les pages qui viennent d'être ajoutée. Le programme va analyser toutes les pages dont la colonne 'parse' est égale à 0.

Retourne le nombre de pages analysées, le nombre de page ajoutées, le nombre de mots indexés.

update($xj,[$idp])

Update url not visited since $xj days for account $idp. If idp is not given, 1 will be used. Url never parsed will be indexed.

Return ($nb,$nbAjout,$nbWords,$nbWordsGood)

  • $nb: Number of links find

  • $nbAjout: Number of links added

  • $nbWords: Number of word find

  • $nbWordsGood: Number of word added

create_table_circa

Create tables needed by Circa - Cree les tables necessaires à Circa:

  • categorie : Catégories de sites

  • links : Liste d'url

  • responsable : Lien vers personne responsable de chaque lien

  • relations : Liste des mots / id site indexes

  • inscription : Inscriptions temporaires

drop_table_circa

Drop all table in Circa ! Be careful ! - Detruit touted les tables de Circa

drop_table_circa_id($id)

Detruit les tables de Circa pour l'utilisateur $id

create_table_circa_id($id)

Create tables needed by Circa for instance $id:

  • categorie : Catégories de sites

  • links : Liste d'url

  • relations : Liste des mots / id site indexes

  • stats : Liste des requetes

export([$mysqldump], [$path])

Export data from Mysql in $path/circa.sql

$mysqldump: path of bin of mysqldump. If not given, search in /usr/bin/mysqldump, /usr/local/bin/mysqldump, /opt/bin/mysqldump.

$path: path of directory where circa.sql will be created. If not given, create it in current directory.

import_data([$mysql], [$path])

Import data in Mysql from circa.sql

$mysql : path of bin of mysql. If not given, search in /usr/bin/mysql, /usr/local/bin/mysql, /opt/bin/mysql

$path: path of directory where circa.sql will be read. If not given, read it from current directory.

Method for administrate each account

admin_compte($compte)

Return list about account $compte

Retourne une liste d'elements se rapportant au compte $compte

  • $responsable : Adresse mail du responsable

  • $titre : Titre du site pour ce compte

  • $nb_page : Number of url added to Circa - Nombre de page pour ce site

  • $nb_words : Number of world added to Circa - Nombre de mots indexés

  • $last_index : Date of last indexation. Date de la dernière indexation

  • $nb_requetes : Number of request aked - Nombre de requetes effectuées sur ce site

  • $racine : First page added - 1ere page inscrite

Retourne la reference vers un hash representant la liste des $max mots les plus présents dans la base de reponsable $id

stat_request($id)

Return some statistics about request make on Circa

inscription($email,$url,$titre)

Inscrit un site dans une table temporaire

HTML functions

header_compte

Function use with CGI admin_compte.cgi. Display list of features of admin_compte.cgi with this account

get_liste_liens($id)

Rend un buffer contenant une balise select initialisée avec les données de la table links responsable $id

get_liste_liens_a_valider($id)

Rend un buffer contenant une balise select initialisée avec les données de la table links responsable $id liens non valides

get_liste_site

Rend un buffer contenant une balise select initialisée avec les données de la table responsable

get_liste_langues

Rend un buffer contenant une balise select initialisée avec les données de la table responsable

get_liste_mot

Rend un buffer contenant une balise select initialisée avec les données de la table responsable

AUTHOR

Alain BARBET alian@alianwebserver.com

1 POD Error

The following errors were encountered while parsing the POD:

Around line 722:

Non-ASCII character seen before =encoding in 'à'. Assuming CP1252