The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

HTML::WikiConverter - An HTML to wiki markup converter

SYNOPSIS

  use HTML::WikiConverter;
  my $wc = new HTML::WikiConverter( dialect => 'MediaWiki' );
  print $wc->html2wiki( $html );

DESCRIPTION

HTML::WikiConverter is an HTML to wiki converter. It can convert HTML source into a variety of wiki markups, called wiki "dialects".

METHODS

new
  my $wc = new HTML::WikiConverter( dialect => $dialect, %attrs );

Returns a converter for the specified dialect. Dies if $dialect is not provided or is not installed on your system. (See "Supported dialects" for a list of supported dialects.) Additional parameters are optional and can be included in %attrs:

  base_uri
    URI to use for converting relative URIs to absolute ones

  wiki_uri
    URI used in determining which links are wiki links. For example,
    the English Wikipedia would use 'http://en.wikipedia.org/wiki/'

  wrap_in_html
    Helps C<HTML::TreeBuilder> parse HTML fragments by wrapping HTML
    in <html> and </html> before passing it through html2wiki()
html2wiki
  my $wiki = $wc->html2wiki( $html );

Converts the HTML source into wiki markup for the current dialect.

parsed_html
  my $html = $wc->parsed_html;

Returns the HTML representative of the last-parsed syntax tree. Use this to see how your input HTML was parsed internally, which is often useful for debugging.

base_uri
  my $base_uri = $wc->base_uri;
  $wc->base_uri( $new_base_uri );

Gets or sets the base_uri option used for converting relative to absolute URIs.

wiki_uri
  my $wiki_uri = $wc->wiki_uri;
  $wc->wiki_uri( $new_wiki_uri );

Gets or sets the wiki_uri option used for determining which links are links to wiki pages.

wrap_in_html
  my $wrap_in_html = $wc->wrap_in_html;
  $wc->wrap_in_html( $new_wrap_in_html );

Gets or sets the wrap_in_html option used to help HTML::TreeBuilder parse (broken) fragments of HTML that aren't contained within a parent element. For example, the following HTML fragment causes trouble:

  Hello<br> goodbye.

This is parsed by HTML::TreeBuilder as:

  <html>
    <head>
    </head>
    <body>
      <p><~text text="Hello"></~text><br>
    </body>
  </html>

Note that the string " goodbye" is missing. This can be resolved by wrapping the HTML fragment in a parent element. In many cases a <p> tag is appropriate, but it the general case <html> is preferred: it has no meaning to wiki dialects and therefore has very little chance of interfering with HTML-to-wiki conversion.

UTILITY METHODS

These methods are for use only by dialect modules.

get_elem_contents
  my $wiki = $wc->get_elem_contents( $node );

Converts the contents of $node (i.e. its children) into wiki markup and returns the resulting wiki markup.

get_wiki_page
  my $title = $wc->get_wiki_page( $url );

Attempts to extract the title of a wiki page from the given URL, returning the title on success, undef on failure. If wiki_uri is empty, this method always return undef. Assumes that URLs to wiki pages are constructed using <wiki-uri><page-name>.

is_camel_case
  my $ok = $wc->is_camel_case( $str );

Returns true if $str is in CamelCase, false otherwise. CamelCase-ness is determined using the same rules as CGI::Kwiki's formatting module uses.

get_attr_str
  my $attr_str = $wc->get_attr_str( $node, @attrs );

Returns a string containing the specified attributes in the given node. The returned string is suitable for insertion into an HTML tag. For example, if $node refers to the HTML

  <style id="ht" class="head" onclick="editPage()">Header</span>

and @attrs contains "id" and "class", then get_attr_str will return 'id="ht" class="head"'.

DIALECTS

HTML::WikiConverter can convert HTML into markup for a variety of wiki engines. The markup used by a particular engine is called a wiki markup dialect. Support is added for dialects by installing dialect modules which provide the rules for how HTML is converted into that dialect's wiki markup.

Dialect modules are registered in the HTML::WikiConverter:: namespace an are usually given names in CamelCase. For example, the rules for the MediaWiki dialect are provided in HTML::WikiConverter::MediaWiki. And PhpWiki is specified in HTML::WikiConverter::PhpWiki.

Supported dialects

HTML::WikiConverter supports conversions for the following dialects:

  DocuWiki
  Kwiki
  MediaWiki
  MoinMoin
  PhpWiki
  PmWiki
  UseMod

While under most conditions the each will produce satisfactory wiki markup, the complete syntactic sugar of each dialect has not yet been implemented. Suggestions, especially in the form of patches, are very welcome.

Of these, the MediaWiki dialect is probably the most complete. I am a Wikipediholic, after all. :-)

Conversion rules

To interface with HTML::WikiConverter, dialect modules must define a single rules class method. It returns a reference to a hash of rules that specify how individual HTML elements are converted to wiki markup. The following rules are recognized:

  start
  end

  preserve
  attributes
  empty

  replace
  alias

  block
  line_format
  line_prefix
  
  trim
  trim_leading
  trim_trailing

For example, the following rules method could be used for a wiki dialect that uses *asterisks* for bold and _underscores_ for italic text:

  sub rules {
    return {
      b => { start => '*', end => '*' },
      i => { start => '_', end => '_' }
    };
  }

To add <strong> and <em> as aliases of <b> and <i>, use the 'alias' rule:

  sub rules {
    return {
      b => { start => '*', end => '*' },
      strong => { alias => 'b' },

      i => { start => '_', end => '_' },
      em => { alias => 'i' }
    };
  }

(If you specify the 'alias' rule, no other rules are allowed.)

Many wiki dialects separate paragraphs and other block-level elements with a blank line. To indicate this, use the 'block' keyword:

  p => { block => 1 }

(Note that if a block-level element is nested inside another block-level element, blank lines are only added to the outermost block-level element.)

However, many such wiki engines require that the text of a paragraph be contained on a single line of text. Or that a paragraph cannot contain any blank lines. These formatting options can be specified using the 'line_format' keyword, which can be assigned the value 'single', 'multi', or 'blocks'.

If the element must be contained on a single line, then the 'line_format' option should be 'single'. If the element can span multiple lines, but there can be no blank lines contained within, then it should be 'multi'. If blank lines (which delimit blocks) are allowed, then it should be 'blocks'. For example, paragraphs are specified like so in the MediaWiki dialect:

  p => { block => 1, line_format => 'multi', trim => 1 }

The 'trim' option indicates that leading and trailing whitespace should be stripped from the paragraph before other rules are processed. You can use 'trim_leading' and 'trim_trailing' if you only want whitespace trimmed from one end of the content.

Some multi-line elements require that each line of output be prefixed with a particular string. For example, preformatted text in the MediaWiki dialect is prefixed with one or more spaces. This is specified using the 'line_prefix' option:

  pre => { block => 1, line_prefix => ' ' }

In some cases, conversion from HTML to wiki markup is as simple as string replacement. When you want to replace a tag and its contents with a particular string, use the 'replace' option. For example, in the PhpWiki dialect, three percent signs '%%%' represents a linebreak <br>, hence the rule:

  br => { replace => '%%%' }

(If you specify the 'replace' option, no other options are allowed.)

Finally, many wiki dialects allow a subset of HTML in their markup, such as for superscripts, subscripts, and text centering. HTML tags may be preserved using the 'preserve' option. For example, to allow the <font> tag in wiki markup, one might say:

  font => { preserve => 1 }

(The 'preserve' rule cannot be combined with the 'start' or 'end' rules.)

Preserved tags may also specify a whitelist of attributes that may also passthrough from HTML to wiki markup. This is done with the 'attributes' option:

  font => { preserve => 1, attributes => [ qw/ font size / ] }

(The 'attributes' rule must be used in conjunction with the 'preserve' rule.)

Some HTML elements have no content (e.g. line breaks), and should be preserved specially. To indicate that a preserved tag should have no content, use the 'empty' rule. This will cause the element to be replaced with "<tag />", with no end tag and any attributes you specified. For example, the MediaWiki dialect handles line breaks like so:

  br => {
    preserve => 1,
    attributes => qw/ id class title style clear /,
    empty => 1
  }

This will convert, e.g., "<br clear='both'>" into "<br clear='both' />". Without specifying the 'empty' rule, this would be converted into the undesirable "<br clear='both'></br>".

(The 'empty' rule requires that 'preserve' is also specified.)

Dynamic rules

Instead of simple strings, you may use coderefs as option values for the 'start', 'end', 'replace', and 'line_prefix' rules. If you do, the code will be called with three arguments: 1) the current HTML::WikiConverter instance, 2) the current HTML::Element node, and 3) the rules for that node (as a hashref).

Specifying rules dynamically is often useful for handling nested elements. For example, the MoinMoin dialect uses the following rules for lists:

  ul => { line_format => 'multi', block => 1, line_prefix => '  ' }
  li => { start => \&_li_start, trim_leading => 1 }
  ol => { alias => 'ul' }

It then defines _li_start like so:

  sub _li_start {
    my( $wc, $node, $rules ) = @_;
    my $bullet = '';
    $bullet = '*'  if $node->parent->tag eq 'ul';
    $bullet = '1.' if $node->parent->tag eq 'ol';
    return "\n$bullet ";
  }

This ensures that every unordered list item is prefixed with '*' and every ordered list item is prefixed with '1.', per the MoinMoin markup. It also ensures that each list item is on a separate line and that there is a space between the prefix and the content of the list item.

Rule validation

Certain rule combinations are not allowed. For example, the 'replace' and 'alias' rules cannot be combined with any other rules, and 'attributes' can only be specified alongside 'preserve'. Invalid rule combinations will trigger an error when the dialect module is loaded.

Preprocessing

The first step in converting HTML source to wiki markup is to parse the HTML into a syntax tree using HTML::TreeBuilder. It is often useful for dialects to preprocess the tree prior to converting it into wiki markup. Dialects that elect to preprocess the tree do so by defining a preprocess_node class method, which will be called on each node of the tree (traversal is done in pre-order). The method receives three arguments: 1) the dialect's package name, 2) the current HTML::WikiConverter instance, and 3) the current HTML::Element node being traversed. It may modify the node or decide to ignore it. The return value of the preprocess_node method is not used.

Because they are so commonly needed, two preprocessing steps are automatically carried out by HTML::WikiConverter, regardless of the dialect: 1) relative URIs in images and links are converted to absolute URIs (based upon the 'base_uri' parameter), and 2) ignorable text (e.g. between </td> and <td>) is discarded.

Postprocessing

Once the work of converting HTML, it is sometimes useful to postprocess the resulting wiki markup. Postprocessing can be used to clean up whitespace, fix subtle bugs in the markup that can't otherwise be done in the original conversion, etc.

Dialects that want to postprocess the wiki markup should define a postprocess_output class method that will be called just before HTML::WikiConverter's output method returns to the client. The method will be passed three arguments: 1) the dialect's package name, 2) the current HTML::WikiConverter instance, and 3) a reference to the wiki markup. It may modify the wiki markup that the reference points to. The return value of postprocess_output is ignored.

For example, to convert a series of line breaks to be replaced with a pair of newlines, a dialect might implement this:

  sub postprocess_output {
    my( $pkg, $wc, $outref ) = @_;
    $$outref =~ s/<br>\s*<br>/\n\n/g;
  }

(This example assumes that HTML line breaks were replaced with <br> in the wiki markup.)

BUGS

Please report bugs using http://rt.cpan.org.

SEE ALSO

HTML::TreeBuilder, HTML::Element

AUTHOR

David J. Iberri <diberri@yahoo.com>

COPYRIGHT

Copyright (c) 2004-2005 David J. Iberri

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

See http://www.perl.com/perl/misc/Artistic.html