Template-Plugin-Gettext

Localization for the Template Toolkit 2

Description

This Perl library offers an end-to-end localization and internationalization solution for the Template Toolkit 2. It consists of a plugin that offers translation functions inside templates and a string extractor xgettext-tt2 that extracts translatable strings from templates and writes them to PO files (or rather a .pot file in PO format). The string extractor xgettext-tt2 is fully customizable and also usable for other i18n plugins or frameworks for the Template Toolkit.

Usage

The solution offered by this library is suitable for templates that have a lot of markup (normally HTML) compared to text. If the files contain a lot of content other solutions are probably more suitable. One of them is xml2po, especially if the input format is HTML.

If the input format is Markdown, for example for a static side generator, a feasible approach may be to simply split the input into paragraphs, and turn each paragraph into an entry of a PO file.

In the following, we will assume that you have decided to localize templates with this library.

Templates

The first step is to mark all translatable strings. This serves a double purpose. Strings are marked, so that the extractor xgettext-tt2 can find them and write them into a translation file in PO format.

The second purpose is that these markers are also valid functions resp. filters for the template toolkit and will interpolate the translations for these messages into the output, when rendering the template. As a result, your templates remain pretty readable after localizing them.

In every source file that you want to use translations, you have to USE the template:

[% USE gtx = Gettext('com.mydomain.www', 'fr') %]

Do not forget to USE the plug-in in all templates! The template toolkit will not warn you, when you forget it but the translation mechanism will not work!

The first argument is the so-called textdomain. This is the identifier for your message catalogs and also the basename of several files. In the example above, the translated message catalog would be searched as LOCALEDIR/fr/LC_MESSAGES/com.mydomain.www.mo. The second parameter is the language. This will normally come from a variable instead of a hard-coded string.

A possible third argument (omitted in the example) is the character set to use, all following arguments are additional directories to search first for translations.

The default list of directories is:

  • ./locale

  • /usr/share/locale

  • /usr/local/share/locale

The directory ./locale is relative to the current working directory from where you invoke the template processor.

Simple Translations With gettext()

The simplest and most common way of doing things is:

[% USE gtx = Gettext('com.mydomain.www', lang) %]

<title>[% gtx.gettext("World Of Themes") %]</title>
    
<h1>[% "Introduction" | gettext %]

<p>
[% FILTER gettext %]
The "World Of Themes" is the ultimate source of templates
for the Template Toolkit.
[% END %]
</p>
This shows three different ways of localizing strings.  You can
use the function C<gtx.gettext()>, the filter C<gettext> with pipe
syntax, or the same filter with block syntax.  The result is always
the same.  The string will be recognized as translatable by 
C<xgettext-tt2> and it will be translated into the selected language,
when rendering the template.

Interpolating Strings Into Translations

One important thing to understand is that the argument to the gettext functions or filters is the lookup key into the translation database, when the template gets rendered. That implies that this key has to be invariable and must not use any interpolated variables.

[% USE gtx = Gettext('com.mydomain.www', lang) %]

[% gtx.gettext("Hello, $firstname $lastname!") %]
This template code is syntactically correct and will also render
correctly.  But C<xgettext-tt2> will bail out on it with an error
message like

   templates.html:3: Illegal variable interpolation at "$"

The function gettext() will receive the interpolated string as its argument, and that is not the same as the string that the extractor program xgettext-tt2 sees. And that means that the translation cannot be found.

The correct way to interpolate strings uses xgettext():

[% USE gtx = Gettext('com.mydomain.www', lang) %]

[% gtx.xgettext("Hello, {first} {last}!",
                first => firstname, last => lastname) %]
[% "Hello, {first} {last}!" | xgettext(first => firstname, 
                                       last => lastname) %]
[% FILTER xgettext(first => firstname, last => lastname) %]
Hello, {first} {last}!
[% END %]
One additional benefit of this is that the extractor program
C<xgettext-tt2> will also mark these strings with the flag
"perl-brace-format".  When the translation from the C<.po>
file gets compiled into an C<.mo> file, the compiler C<msgfmt>
checks that the translated strings contains exactly the same
placeholders as the original.

One thing that you should also avoid is to assemble strings in the template source code. Do not:

[% gtx.gettext("Please contact") %] [% name %]
[% gtx.gettext("for help about the") %] [% package %]
[% gtx.gettext("software.") %]
This will result in three translatable text snippets
"Please contact", "for help about the", and "software." that
are hard to translate without context.  Besides it makes
illegal assumptions about the word order in translated sentences.
Instead, use C<xgettext()> and write in complete sentences with
placeholders.

By the way, the x in the function xgettext() stands for eXpand while the x in the program xgettext-tt2 or GNU Gettext's xgettext program stands for eXtract.

Plural Forms

Do not write this:

[% IF num != 1 %]
[% gtx.xgettext("{number} documents deleted!", number => num) %]
[% ELSE %]
[% gtx.gettext("One document deleted!") %]
[% END %]
This assumes that every language has one singular and one plural
(and no other forms) and that the condition that selects the correct
form is always C<COUNT != 1>.  But this is wrong for many languages
for example Russian (two plural forms), Chinese (no plural), French
(different condition), and many more.

Write instead:

[% USE gtx = Gettext('com.mydomain.www', lang) %]

[% gtx.nxgettext("One document deleted.", 
                 "{count} documents deleted."
                 num,
                 count => num) %]
The function C<nxgettext()> receives the singular and plural
form as the first and second argument, followed by the number
of items, followed by an arbitrary number of key/value pairs
for interpolating variables in the strings.

There is also a function ngettext() that does not expand its first two arguments. You will find out that you almost never need that function.

You can also use nxgettext() and ngettext() as filters. But the necessary code is awkward, and their use is therefore not recommended.

Ambiguous Strings (message contexts)

Sometimes an English string has different meanings in other languages:

[% USE gtx = Gettext('com.mydomain.www', lang) %]

[% gtx.gettext("State:") %]
[% IF state == '1' %]
[% gtx.pgettext("state", "Open") %]
[% ELSE %]
[% gtx.gettext("Closed") %]
[% END %]
<a href="/action/open">[% gtx.pgettext("action", "Open") %]</a>
The function C<pgettext()> works like gettext but has one 
extra argument preceding the string, the so-called
message context.  The string extractor C<xgettext-tt2> will now
create two distinct messages "Open", one with the context "state",
the other one with the context "action".  The sole purpose of this
context is to disambiguate the string "Open" for languages where the
verb ("to open") and the adjective ("the door is I<open>") has
two distinct translations.

You will normally use this function, when a translator asks you to do so, but not on your own behalf.

There is also a function pxgettext() that supports placeholder interpolation, and npxgettext() that has the following semantics:

npxgettext(CONTEXT, SINGULAR, PLURAL, COUNT,
           KEY1 => VALUE1, KEY2 => VALUE2, ...)

More Esoteric Functions

The API documentation contains some more functions and filters that are available for completeness. You will never need them in normal projects.

Translator Hints

You can add comments to the source code that are copied into the .po file as hints for the translators. This will look like this:

[% USE gtx = Gettext('com.mydomain.www', lang) %]

<!-- TRANSLATORS: This is the day of the week! -->
[% gtx.gettext("Sun") %]
In order to make that work, you have to invoke the extractor
program C<xgettext-tt2> like this:

   xgettext-tt2 --add-comments=TRANSLATORS: t1.html t2.html ...

Modifying Flags

In rare situations, you may need the following:

[% USE gtx = Gettext('com.mydomain.www', lang) %]

<!-- xgettext:no-perl-brace-format -->
[% gtx.xgettext("Value: {value}", value => whatever) %]
Normally, the argument of C<xgettext()> will be flagged in
the C<.po> file with "perl-brace-format", and a translation
will fail to compile if the translation does not contain exactly
the same placeholders as the original does.

You can override that default behavior for individual messages by placing a comment containing the string "xgettext:" directly in front of the string.

Translation Workflow

The translation workflow is the standard workflow known from GNU Gettext. All files relevant for translations are conventionally kept in a subdirectory po.

You can save time if you use the seed project Template-Plugin-Gettext-Seed as a base. It contains a directory po ready for use, with --- at your choice --- a Makefile or a script po-make.pl that automates the entire translation workflow. It is also prepared for extracting strings from other sources than template files. In that example, these are Perl source files, but it will work in a similar fashion for other programming languages.

But rolling your own version is also simple. Just read on.

Extracting Strings With xgettext-tt2

Extracting translatable strings from templates for the Template Toolkit 2 is as easy as:

$ xgettext-tt2 TEMPLATE....
This will scan all files given as arguments for translatable strings
and create a file C<messages.po> with the strings found.

The normal invocation of xgettext-tt2 is normally a little bit more sophisticated:

$ xgettext-tt2 --files-from=POTFILES \
    --output=com.mydomain.www.pot \
    --add-comments=TRANSLATORS: --from-code=utf-8 \
    --force-po
You can, of course, write everyting in one line and omit the backslashes.

Specifying all input files as arguments on the command-line can quickly become unwieldy. It is more common to put the list of input files into a text file, each input file on one line, and instruct xgettext-tt2 to read it with the option --files-from. The name of the file is by convention POTFILES.

The output file is normally a file TEXTDOMAIN.pot, where TEXTDOMAION is the identifier selected in the templates. The reverse hostname of the server serving the rendered templates is a good choice.

If you want to be able to give hints to translators in the source files, you have to specify the trigger string --- normally "TRANSLATORS:" --- with the option --add-comments. Specifying an empty string (--add-comments='') instructs xgettext-tt2 to copy all comments into the .pot file.

If your templates contain characters outside of US-ASCII, you should specify the character set of the template files with the option --from-code=CODESET.

The option --force-po instructs xgettext-tt2 to write an output file even if no translatable strings had been found. But this is a matter of taste. Omit the option, if you prefer it.

xgettext-tt2 has a lot more options. They are mostly compatible with the ones of xgettext from GNU gettext for C, Perl, and a lot more languages. See the documentation for GNU Gettext's xgettext and the documentation for Locale::XGettext for more information.

By the way, why is the ouput file a .pot file and not a .po file? It is the template for the .po files for the individual languages. You never edit that file, but re-generate it, whenever the source files have changed. Hence, it only contains strings in the original, in the base language.

Creating Translation Files

For each supported language (except for the base language) you should create a file LL.po, where LL is the two-letter language code for that language, for example fr.po, de.po, or it.po. You can also specify the combination of language and country like in de_DE.po or pt_BR.po.

One option for that is to simply copy the .pot file and edit the header accordingly. It is normally easier to do that with the program msginit:

$ msginit --input=com.mydomain.www.pot --locale=fr
Replace C<TEXTDOMAIN.pot> with the name of the C<.pot> file, and
C<fr> with the language in question.  This will prefill a lot
of fields in the C<.po> file.

Compiling Translation Files

The translated .po files are compiled with the program msgfmt:

$ msgfmt --check --statistics --verbose -o fr.mo fr.po
fr.po: 212 translated messages, 1 fuzzy translation, 3 untranslated messages.
This will compile the translation file C<fr.po> into a binary
file C<fr.mo>.  It also checks the translations for formal errors
and print statistics about the number of translated and
untranslated strings.

Installing Translation Files

The plugin does not use .po files for looking up translations but the binary .mo files. But it has to find them.

You have to decide for one of the directories that Template::Plugin::Gettext searches for translations. The default order is:

  • @INC/LocaleData

  • /usr/share/locale

  • /usr/local/share/locale

The first line means that every directory LocaleDir inside Perl's include directories is searched for translation files. Keep in mind, that for security reasons the current directory (.) is nowadays often not in Perl's @INC.

Let's assume that /var/www/lib is in Perl's @INC. You would then install the French translation file fr.mo as /var/www/lib/LocaleData/fr/LC_MESSAGES/com.mydomain.www.mo. TEXTDOMAIN is a placeholder for the textdomain you have selected (and LC_MESSAGES is not a placeholder but a real directory name).

That is good except for the fact that /var/www/lib is usually not in Perl's @INC. But you can change that where you invoke the template processor:

BEGIN {
    unshift @INC, '/var/www/lib';
}

use Template;

Template->new->process('template.html', $data);
You can completely override the default search order in the
templates:


[% USE gtx = Gettext('com.mydomain.www', lang, 'utf-8', 
                     '/var/www/locale', '/srv/www/locale')]
Now, the French translation would be searched in 
C</var/www/locale/fr/LC_MESSAGES/com.mydomain.www.mo> and 
C</var/www/locale/fr/LC_MESSAGES/com.mydomain.www.mo>.

Updating Translation Files

Translations may become obsolete, when the source templates change. In this case, you have to merge the new set of translatable strings into the existing translation files. Fortunately, GNU Gettext makes this easy:

$ xgettext-tt2 --files-from=POTFILES \
    --output=com.mydomain.www.pot \
    --add-comments=TRANSLATORS: --from-code=utf-8 \
    --force-po
$ cp fr.po fr.old.po
$ msgmerge fr.old.po com.mydomain.www.pot -o fr.po
....... done
You first update the C<.pot> file with C<xgettext-tt2> so that it
contains the current set of translatable strings.  You then
make a backup of each C<.po> file and then invoke the program
C<msgmerge> for merging the current translations from C<fr.old.po>
with the new set of strings from C<com.mydomain.www.pot> into
the updated translation file C<fr.po>.

The file fr.po will now contain the new strings as untranslated entries. Strings that have only slightly change will retain their translations but they will be marked as "fuzzy", so that they can be reviewed by a translator. Entries for strings that are no longer present in the sources are obsoleted.

Integrating With Other Programming Languages

The GNU Gettext framework is available for a lot of programming languages and it is not uncommon that two or more of these languages are mixed in a project. It is beneficial in these cases to use a common translation base for all used technologies.

xgettext-tt2 is based on Locale-XGettext and therefore not only understands Template Toolkit templates but also .po and .pot files as input. GNU Gettext's xgettext has the same feature.

Accumulating all translatable strings from the different technologies is therefore very easy. If you have a project that uses Template Toolkit for rendering web pages and Perl for the business logic you first extract strings from your Perl files --- as usual --- with xgettext from GNU gettext into a temporary file, for example plfiles.pot. Then you extract the strings from the templates with xgettext-tt2 from this library, but you specify plfiles.pot as an additional input file. And now the output file of xgettext-tt2 contains all the strings from the template files plus those from the Perl files in plfiles.pot.

Of course, you can also do it the other way round, extract with xgettext-tt2 into ttfiles.pot, and then feed that as an additional input file to GNU Gettext's xgettext.

You can use the seed project Template-Plugin-Gettext-Seed as a fully functional starting point for such setups.

Bugs

Please report bugs at https://github.com/gflohr/Template-Plugin-Gettext/issues

Author

Template-Plugin-Gettext was written by Guido Flohr.