NAME
Circa::Indexer - provide functions to administrate Circa, a www search engine running with Mysql
SYNOPSIS
use Circa::Indexer;
my $indexor = new Circa::Indexer;
if (!$indexor->connect_mysql($user,$pass,$db))
{die "Erreur à la connection MySQL:$DBI::errstr\n";}
$indexor->create_table_circa;
$indexor->drop_table_circa;
$indexor->addSite("http://www.alianwebserver.com/",
'alian@alianwebserver.com',
"Alian Web Server");
my ($nbIndexe,$nbAjoute,$nbWords,$nbWordsGood) = $indexor->parse_new_url(1);
print "$nbIndexe pages indexées,"
"$nbAjoute pages ajoutées,"
"$nbWordsGood mots indexés,"
"$nbWords mots lus\n";
$indexor->update(30,1);
Look in admin.pl,admin.cgi,admin_compte.cgi
DESCRIPTION
This is Circa::Indexer, a module who provide functions to administrate Circa, a www search engine running with Mysql. Circa is for your Web site, or for a list of sites. It indexes like Altavista does. It can read, add and parse all url's found in a page. It add url and word to MySQL for use it at search.
This module provide routine to :
Add url
Create and update each account
Parse url, Index words, and so on.
Provide routine to administrate present url
Remarques:
This file are not added : doc,zip,ps,gif,jpg,gz,pdf,eps,png, deb,xls,ppt,class,GIF,css,js,wav,mid
Weight for each word is in hash $ConfigMoteur
Features ?
Features
Search Features
Boolean query language support : or (default) and ("+") not ("-"). Ex perl + faq -cgi : Documents with faq, eventually perl and not cgi.
Client Perl or PHP
Can browse site by directory / rubrique.
Search for different criteria: news, last modified date, language, URL / site.
Full text indexing
Different weights for title, keywords, description and rest of page HTML read can be given in configuration
Herite from features of LWP suite:
Support protocol HTTP://,FTP://, FILE:// (Can do indexation of filesystem without talk to Web Server)
Full support of standard robots exclusion (robots.txt). Identification with CircaIndexer/0.1, mail alian@alianwebserver.com. Delay requests to the same server for 8 secondes. "It's not a bug, it's a feature!" Basic rule for HTTP serveur load.
Support proxy HTTP.
Make index in MySQL
Read HTML and full text plain
Several kinds of indexing : full, incremental, only on a particular server.
Documents not updated are not reindexed.
All requests for a file are made first with a head http request, for information such as validate, last update, size, etc.Size of documents read can be restricted (Ex: don't get all documents > 5 MB). For use with low-bandwidth connections, or computers which do not have much memory.
HTML template can be easily customized for your needs.
Admin functions available by browser interface or command-line.
Index the different links found in a CGI (all after name_of_file?)
How it's work ?
Circa parse html document. convert it to text. It count all word found and put result in hash key. In addition of that, it read title, keywords, description and add a weight to all word found.
Example: my %ConfigMoteur=( 'author' => 'circa@alianwebserver.com', # Responsable du moteur 'temporate' => 1, # Temporise les requetes sur le serveur de 8s. 'facteur_keyword' => 15, # <meta name="KeyWords" 'facteur_description' => 10, # <meta name="description" 'facteur_titre' => 10, # <title></title> 'facteur_full_text' => 1, # reste 'facteur_url' => 15, # Mots trouvés dans l'url 'nb_min_mots' => 2, # facteur min pour garder un mot 'niveau_max' => 7, # Niveau max à indexer 'indexCgi' => 0, # Suit les différents liens des CGI (ex: ?nom=toto&riri=eieiei) );
<html>
<head>
<meta name="KeyWords"
CONTENT="informatique,computing,javascript,CGI,perl">
<meta name="Description" CONTENT="Rubriques Informatique (Internet,Java,Javascript, CGI, Perl)">
<title>Alian Web Server:Informatique,Société,Loisirs,Voyages,Expression</title>
</head>
<body>
different word: cgi, perl, cgi
</body>
</html>
After parsing I've a hash with that:
$words{'informatique'}= 15 + 10 + 10 =35
$words{'cgi'} = 15 + 10 +1
$words{'different'} = 1
Words is add to database if total found is > $ConfigMoteur{'nb_min_mots'} (2 by default). But if you set to 1, database will grow very quicly but allow you to perform very exact search with many worlds so you can do phrase searches. But if you do that, think to take a look at size of table relation.
After page is read, it's look into html link. And so on. At each time, the level grow to one. So if < to $Config{'niveau_max'}, url is added.
Remarques
Use phpMyAdmin, and script dump and import.cgi for make index on another server
VERSION
$Revision: 1.11 $
Class Interface
Constructors and Instance Methods
- new [PARAMHASH]
-
You can use the following keys in PARAMHASH:
-
Default: 'circa@alianwebserver.com', appear in log file of web server indexed (as agent)
- temporate
-
Default: 1, boolean. If true, wait 8s between request on same server and LWP::RobotUA will be used. Else this is LWP::UserAgent (more quick because it doesn't request and parse robots.txt rules, but less clean because a robot must always say who he is, and heavy server load is avoid).
- facteur_keyword
-
Default: 15, weight of word found on meta KeyWords
- facteur_description
-
Default:10, weight of word found on meta description"
- facteur_titre
-
Default:10, weight of word found on <title></title>
- facteur_full_text
-
Default:1, weight of word found on rest of page
- facteur_url
-
Default: 15, weight of word found in url
- nb_min_mots
-
Default: 2, minimal number of times a word must be found to be added
- niveau_max
-
Default: 7, Maximal number of level of links to follow
- indexCgi
-
Default 0, follow of not links of CGI (ex: ?nom=toto&riri=eieiei)
-
- size_max($size)
-
Get or set size max of file read by indexer (For avoid memory pb).
- port_mysql($port)
-
Get or set the MySQL port
- host_indexed($host)
-
Get or set the host indexed.
- set_host_indexed($url)
-
Set base directory with $url. It's used for restrict access only to files found on sub-directory on this serveur.
- proxy($adr_proxy)
-
Get or set proxy for LWP::Robot or LWP::Agent
Ex: $circa->proxy('http://proxy.sn.no:8001/');
- prefix_table
-
Get or set the prefix for table name for use Circa with more than one time on a same database
- connect_mysql($user,$password,$db,$server)
-
$user : User MySQL
$password : Password MySQL
$db : Database MySQL
$server : Adr IP MySQL
Connect Circa to MySQL. Return 1 on succes, 0 else
- close_connect
-
Close connection to MySQL
Methods use for global adminstration
- addSite($url,$email,$titre,$categorieAuto,$cgi,$rep,$file);
-
Ajoute le site d'url $url, responsable d'adresse mail $email à la bd de Circa Retourne l'id du compte cree
Create account for url $url. Return id of account created.
- addLocalSite($url,$email,$titre,$local_url,$path,$urlRacine,$categorieAuto,$cgi,$rep,$file);
-
Add a local $url
- updateUrl($compte,$id,$url,$urllocal,$titre,$description,$langue, $categorie,$browse_categorie,$parse,$valide,$niveau,$last_check,$last_update)
-
Update url $id on table $prefix.$compte.links
- parse_new_url($idp)
-
Parse les pages qui viennent d'être ajoutée. Le programme va analyser toutes les pages dont la colonne 'parse' est égale à 0.
Retourne le nombre de pages analysées, le nombre de page ajoutées, le nombre de mots indexés.
- update($xj,[$idp])
-
Update url not visited since $xj days for account $idp. If idp is not given, 1 will be used. Url never parsed will be indexed.
Return ($nb,$nbAjout,$nbWords,$nbWordsGood)
$nb: Number of links find
$nbAjout: Number of links added
$nbWords: Number of word find
$nbWordsGood: Number of word added
- create_table_circa
-
Create tables needed by Circa - Cree les tables necessaires à Circa:
categorie : Catégories de sites
links : Liste d'url
responsable : Lien vers personne responsable de chaque lien
relations : Liste des mots / id site indexes
inscription : Inscriptions temporaires
- drop_table_circa
-
Drop all table in Circa ! Be careful ! - Detruit touted les tables de Circa
- drop_table_circa_id
-
Detruit les tables de Circa pour l'utilisateur id
- create_table_circa_id($id)
-
Create tables needed by Circa for instance $id:
categorie : Catégories de sites
links : Liste d'url
relations : Liste des mots / id site indexes
stats : Liste des requetes
- export($mysqldump)
-
Export data from Mysql in circa.sql
$mysqldump: path of bin of mysqldump. If not given, search in /usr/bin/mysqldump, /usr/local/bin/mysqldump, /opt/bin/mysqldump
- import_data($mysql)
-
Import data in Mysql from circa.sql
$mysql : path of bin of mysql. If not given, search in /usr/bin/mysql, /usr/local/bin/mysql, /opt/bin/mysql
Method for administrate each account
- admin_compte($compte)
-
Return list about account $compte
Retourne une liste d'elements se rapportant au compte $compte
$responsable : Adresse mail du responsable
$titre : Titre du site pour ce compte
$nb_page : Number of url added to Circa - Nombre de page pour ce site
$nb_words : Number of world added to Circa - Nombre de mots indexés
$last_index : Date of last indexation. Date de la dernière indexation
$nb_requetes : Number of request aked - Nombre de requetes effectuées sur ce site
$racine : First page added - 1ere page inscrite
- most_popular_word($max,$id)
-
Retourne la reference vers un hash representant la liste des $max mots les plus présents dans la base de reponsable $id
- stat_request($id)
-
Return some statistics about request make on Circa
- delete_url($compte,$id_url)
-
Delete url with id $id_url on account $compte
Supprime le lien $id_url de la table $compte/relation et $compte/links
- valide_url($compte,$id_url)
-
Commit link $id_url on table $compte/links
Valide le lien $id_url
- masque_categorie($compte,$id,$file)
-
Use a different masque for browse this categorie
- delete_categorie($compte,$id)
-
Supprime la categorie $id pour le compte de responsable $compte et tous les liens et relation qui sont dans cette categorie
- rename_categorie($compte,$id,$nom)
-
Rename category $id for account $compte in $name
Renomme la categorie $id pour le compte $compte en $nom
- deplace_categorie($compte,$id1,$id2)
-
Move url for account $compte from one categorie $id1 to another $id2
- add_site($url,[$idMan],[$local_url],[$browse_categorie],[$niveau],[$categorie])
-
Ajoute un site à la table links.
$url : Url de la page à ajouter
$idMan : Id dans la table responsable du responsable de ce site Si non present, positionné à 1.
$local_url : Url accessible par file:// pour les documents pouvant être indexé en local
$browse_categorie : 0 ou 1. (Apparait ou pas dans la navigation par categorie). Si non present, 0.
$niveau : Profondeur de l'indexation pour ce document. Si non present, positionné à 0.
$categorie : Categorie de cet url. Si non present, positionné à 0.
Si une erreur est trouvée, $DBI::errstr est positionnée et 0 est retourné. 1 sinon.
- inscription($email,$url,$titre)
-
Inscrit un site dans une table temporaire
HTML functions
- start_classic_html
-
Affiche le debut de document (<head></head>)
- header_compte
-
Function use with CGI admin_compte.cgi. Display list of features of admin_compte.cgi
- get_liste_liens($id)
-
Rend un buffer contenant une balise select initialisée avec les données de la table links responsable $id
- get_liste_liens_a_valider($id)
-
Rend un buffer contenant une balise select initialisée avec les données de la table links responsable $id liens non valides
- get_liste_site
-
Rend un buffer contenant une balise select initialisée avec les données de la table responsable
- get_liste_categorie($id,$cgi)
-
Return two references to a list and a hash. The hash have name of categorie as key, and number of site in this categorie as value. The list is ordered keys of hash.
- get_liste_mot($compte,$id)
-
Give word indexed on url $id on table $prefix.$compte.links. Return a buffer with words separated by space.
- fill_template($masque,$vars)
-
$masque : Chemin du template $vars : reference du hash des noms/valeurs à substituer dans le template
Give template with variables replaced.. Ex: si $$vars{age}=12, et que le fichier $masque contient la chaine:
J'ai <? $age ?> ans,
la fonction rendra
J'ai 12 ans,
Private methods
- look_at ($url,$idc,$idr,$lastModif,$url_local,$categorieAuto,$niveau,$categorie)
-
Index an url. Job done is:
Test if url used is valid. Return -1 else
Get the page and add each words found with weight set in constructor.
If maximum level of links is not reach, add each link found for the next indexation
Parameters:
$url : Url to read
$idc: Id of url in table links
$idr : Id of account's url
$lastModif (optional) : If this parameter is set, Circa didn't make any job on this page if it's older that the date.
$url_local (optional) Local url to reach the file
$categorieAuto (optional) If $categorieAuto set to true, Circa will create/set the category of url with syntax of directory found. Ex:
http://www.alianwebserver.com/societe/stvalentin/index.html will create and set the category for this url to Societe / StValentin
If $categorieAuto set to false, $categorie will be used.
$niveau (optional) Level of actual link.
$categorie (optional) See $categorieAuto.
Return (-1,0) if url isn't valide, number of word and number of links found else
- get_meta($entete)
-
Parse et rend les meta-mots-clef et la meta-description de la page HTML contenu dans $entete
- analyse_data($data,$facteur,%l)
-
Recupere chaque mot du buffer $data et lui attribue une frequence d'apparition. Les resultats sont ranges dans le tableau associatif passé en paramètre. Les résultats sont rangés sous la forme %l=('mots'=>facteur).
$data : buffer à analyser
$facteur : facteur à attribuer à chacun des mots trouvés
%l : Tableau associatif où est rangé le résultat
Retourne la référence vers le hash
- getParent($id,%tab)
-
Rend la chaine correspondante à la catégorie $id avec ses rubriques parentes
- set_agent
-
Set user agent for Circa robot. If local url (file://), LWP::UserAgent will be used. Else LWP::RobotUA is used.
- check_links($tag,$links)
-
Check if url $links will be add to Circa. Url must begin with $self->host_indexed, and his extension must be not doc,zip,ps,gif,jpg,gz,pdf,eps,png,deb,xls,ppt,class, GIF,css,js,wav,mid.
If $links is accepted, return url. Else return 0.
- get_first($requete)
-
Retourne la premiere ligne du resultat de la requete $requete sous la forme d'un tableau
AUTHOR
Alain BARBET alian@alianwebserver.com
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 106:
Non-ASCII character seen before =encoding in 'à'. Assuming CP1252