The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

HTML::WikiConverter - An HTML to wiki markup converter

SYNOPSIS

  use HTML::WikiConverter;
  my $wc = new HTML::WikiConverter( dialect => 'MediaWiki' );
  print $wc->html2wiki($html);

DESCRIPTION

HTML::WikiConverter is an HTML to wiki converter. It can convert HTML source into a variety of wiki markups, called wiki "dialects".

METHODS

$wc = new HTML::WikiConverter( dialect => '...', [ %opts ] );

Returns a converter for the specified dialect. If 'dialect' is not provided or is not installed on your system, this method dies. Additional options are specified in %opts, and include:

  base_uri
    the URI to use for converting relative URIs to absolute ones
$base_uri = $wc->base_uri( [ $new_base_uri ] );

Gets or sets the 'base_uri' option used for converting relative to absolute URIs.

$wiki = $wc->html2wiki( $html );

Converts the HTML source into wiki markup for the current dialect.

$html = $wc->parsed_html;

Returns the HTML representative of the last-parsed syntax tree. Use this to see how your input HTML was parsed internally, which is useful for debugging.

UTILITY METHODS

$wiki = $wc->elem_contents( $node )

Converts the contents of $node into wiki markup.

DIALECTS

HTML::WikiConverter can convert HTML into markup for a variety of wiki engines. The markup used by a particular engine is called a wiki markup dialect. Support is added for dialects by installing dialect modules which provide the rules for how HTML is converted into that dialect's wiki markup.

Dialect modules are registered in the HTML::WikiConverter:: namespace an are usually given names in CamelCase. For example, the rules for the MediaWiki dialect are provided in HTML::WikiConverter::MediaWiki. And PhpWiki is specified in HTML::WikiConverter::PhpWiki.

head2 Supported dialects

  MediaWiki
  MoinMoin
  PhpWiki
  Kwiki

Rules

To interface with HTML::WikiConverter, dialect modules must define a single rules() class method. It returns a reference to a hash of rules that specify how individual HTML elements are converted to wiki markup. For example, the following rules() method could be used for a wiki dialect that used *asterisks* for bold and _underscores_ for italic text:

  sub rules {
    return {
      b => { start => '*', end => '*' },
      i => { start => '_', end => '_' }
    };
  }

It is sometimes to define tags as aliases, for example to treat <strong> and <b> the same. For that, use the 'alias' keyword:

  sub rules {
    return {
      b => { start => '*', end => '*' },
      i => { start => '_', end => '_' },

      strong => { alias => 'b' },
      em => { alias => 'i' }
    };
  }

(Note that if you specify the 'alias' option, no other options are allowed.)

Many wiki dialects separate paragraphs and other block-level elements with a blank line. To indicate this, use the 'block' keyword:

  p => { block => 1 }

However, many such wiki engines require that the text of a paragraph be contained on a single line of text. Or that a paragraph cannot contain any blank lines. These formatting options can be specified using the 'line_format' keyword, which can be assigned the value 'single', 'multi', or 'blocks'.

If the element must be contained on a single line, then the 'line_format' option should be 'single'. If the element can span multiple lines, but there can be no blank lines contained within, then it should be 'multi'. If blank lines (which delimit blocks) are allowed, then it should be 'blocks'. For example, paragraphs are specified like so in the MediaWiki dialect:

  p => { block => 1, line_format => 'multi', trim => 1 }

The 'trim' option indicates that leading and trailing whitespace should be stripped from the paragraph before other rules are processed. You can use 'trim_leading' and 'trim_trailing' if you only want whitespace trimmed from one end of the content.

Some multi-line elements require that each line of output be prefixed with a particular string. For example, preformatted text in the MediaWiki dialect is prefixed with one or more spaces. This is specified using the 'line_prefix' option:

  pre => { block => 1, line_prefix => ' ' }

In some cases, conversion from HTML to wiki markup is as simple as replacing an element with a particular string. This is done with the 'replace' option. For example, in the PhpWiki dialect, three percent signs '%%%' represents a linebreak <br>:

  br => { replace => '%%%' }

(Note that if you specify the 'replace' option, no other options are allowed.)

Finally, many (if not all) wiki dialects allow a subset of HTML in their markup, such as for superscripts, subscripts, and text centering. HTML tags may be preserved using the 'preserve' option. For example, to allow the <font> tag in wiki markup, one might say:

  font => { preserve => 1 }

Preserved tags may also specify a whitelist of attributes that may also passthrough from HTML to wiki markup. This is done with the 'attributes' option:

  font => { preserve => 1, attributes => [ qw/ font size / ] )

Dynamic rules

Instead of simple strings, you may use coderefs as option values for the 'start', 'end', 'replace', and 'line_prefix' rules. If you do, the code will be called with three arguments: 1) the current HTML::WikiConverter instance, 2) the current HTML::Element node, and 3) the rules for that node (as a hashref).

Specifying rules dynamically is often useful for handling nested elements.

Preprocessing

The first step in converting HTML source to wiki markup is to parse the HTML into a syntax tree using HTML::TreeBuilder. It is often useful for dialects to preprocess the tree prior to converting it into wiki markup. Dialects that elect to preprocess the tree do so by defining a preprocess_node() class method, which will be called on each node of the tree (traversal is done in pre-order). The method receives three arguments: 1) the dialect's package name, 2) the current HTML::WikiConverter instance, and 3) the current HTML::Element node being traversed. It may modify the node or decide to ignore it. The return value of the preprocess_node() method is not used.

Because they are so commonly needed, two preprocessing steps are automatically carried out by HTML::WikiConverter, regardless of the current dialect: 1) relative URIs are converted to absolute URIs (based upon the 'base_uri' parameter), and 2) ignorable content (e.g. between </td> and <td>) is discarded.

SEE ALSO

  HTML::TreeBuilder
  HTML::Element

AUTHOR

David J. Iberri <diberri@yahoo.com>

COPYRIGHT

Copyright (c) 2004-2005 David J. Iberri

This library is free software; you may redistribute it and/or modify it under the same terms as Perl itself.