NAME
wikipedia2alvis.pl - Wikipedia XML dump to Alvis XML converter
SYNOPSIS
wikipedia2alvis.pl [options] [Wikipedia XML dump file]
Options:
--out-dir output directory
--namespaces list of namespaces to extract
--N-per-out-dir # of records per output directory
--[no-]original include original document?
--[no-]expand-templates-fully do we try to expand templates fully?
--[no-]dump-templates do we dump the templates?
--template-dump-file the file to dump the templates to
--[no-]convert-via-html do we convert via HTML or directly to Alvis?
--date the date of the Wikipedia dump
--[no-]dump-category-graph do we dump the category graph?
--category-graph-dump-file the file to dump the category graph to
--category-word category namespace identifier
--root-category root category identifier
--template-word template namespace identifier
--language the language of the Wikipedia dump
--help brief help message
--man full documentation
--[no]warnings warnings output flag
OPTIONS
- --out-dir
-
Sets the output directory. Default value: '.'.
- --namespaces
-
Sets the namespaces whose records to extract. Given as a ','-separated list. The namespace names have to be the exact identifiers. Articles are always extracted. Default value: '''', i.e. articles.
- --N-per-out-dir
-
Sets the # of records per output directory. Default value: 1000.
- --[no-]original
-
Shall the original document be included in the output? Default value: no.
- --[no-]expand-templates-fully
-
Do we try to expand templates fully or do we simply insert a list of the template parameter values given in the call? Default value: no.
- --[no-]dump-templates
-
Do we dump the templates onto disk in a loadable format? Default value: no.
- --template-dump-file
-
The name of the (possible) template dump file. Default value: 'Templates.storable'.
- --[no-]convert-via-html
-
Do we sacrifice speed for quality (possibly) by converting from Wikitext to Alvis XML via an intermediate HTML version. Default value: yes.
- --language
-
The language of the Wikipedia dump. Affects category and template extraction. Possible values: 'en' (English), 'fr' (French), 'sl' (Slovenian). Default value: 'en'.
- --category-word
-
The identifier for the category namespace. Overruled by '--language'. Default value: 'Category'.
- --root-category
-
The identifier for the root category of the category graph. Overruled by '--language'. Default value: 'fundamental'.
- --template-word
-
The identifier for the template namespace. Overruled by '--language'. Default value: 'Template'.
- --date
-
The date of the Wikipedia dump as YYYYMMDD. Default value: undefined (means: use current date).
- --[no-]dump-category-graph
-
Do we dump the category graph onto disk in a loadable format?. Default value: yes.
- --category-graph-dump-file
-
The name of the (possible) category graph dump file. Default value: 'CategoryGraph.storable'.
- --help
-
Prints a brief help message and exits.
- --man
-
Prints the manual page and exits.
- --[no]warnings
-
Output (or suppress) warnings. Default value: yes.
DESCRIPTION
Converts the articles in the Wikipedia XML dump to Alvis records.