The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

Conventions used in this manual

  • Every non absolute path is relative to the source code's directory.

Adding a transliteration

If you want to add a new transliteration to Lingua::Translit just...

  • write an XML file (the "transliteration table")

  • build a development version containing your table

  • write and run some tests to check if your transliteration is working as expected.

  • integrate your table into the set of upstream tables and consider contributing it

Writing a transliteration table

Each XML-file consists of meta data and a set of transliteration rules.

The meta data tags cover the name of the transliteration, a short description and the information whether the transliteration can be used in both directions. For example:

  <name>DIN 1460</name>
  <desc>DIN 1460: Cyrillic to Latin</desc>
  <reverse>true</reverse>

The rules can be simple one to one mappings

 <rule>
    <from>X</from>
    <to>Y</to>
  </rule>

...but you can also specify a context in which the rule should be evaluated only:

  <rule>
    <from>A</from>
    <to>B</to>
    <context>
      <after>x</after>
      <before>y</before>
    </context>
  </rule>

To get an easy start, you can copy the file xml/template.xml, rename it to your needs and edit it right away.

Template: template.xml
Complete example: Common_DEU.xml

Although editing an XML-file is technically quite easy, some things have to be respected. The most important thing to keep in mind is that the rules are applied in sequence - one after another. Therefore the order of rules is important if you specify a context or transliterate multiple characters.

Unicode notation

If you are determining characters that are non-ASCII characters, use an entity that represents the Unicode code point in hex-notation to specify them and leave a comment on the character.

  <rule>
    <from>&#x0410;</from>         <!-- CYRILLIC CAPITAL LETTER A --> 
    <to>A</to>
  </rule>

This assures that the correct character is transformed and it can be exactly determined, if it is not represented correctly.

Specifying a context

The context is evaluated as a Perl regular expression. So for specifying the context literal ASCII characters, entities or meta characters can be used.

If a character has two mappings depending on the context, the context-sensitive rule must be applied first and then the context-free rule. Otherwise every character is replaced at once through the context-free rule and the context-sensitive rule will never match.

1. rule

  <rule>
    <from>&#x0393;&#x03BA;</from> <!-- GREEK CAPITAL LETTER GAMMA &
                                       GREEK SMALL LETTER KAPPA -->
    <to>Gk</to>
    <context>
      <after>\b</after>           <!-- word initial -->
    </context>
  </rule>

2. rule

  <rule>
    <from>&#x0393;&#x03BA;</from> <!-- GREEK CAPITAL LETTER GAMMA &
                                       GREEK SMALL LETTER KAPPA -->
    <to>Nk</to>
  </rule>

The following pattern matching contexts are available:

<after>

if the transliteration rule should only be applied after a certain character (corresponds to Perl's lookbehind)

<before>

if the rule should only be applied before a certain character (corresponds to Perl's lookahead)

<after> & <before>

if the rule should only be applied if the character is in between two characters

Multiple characters

As all rules are applied in sequence, and hence the order of rules is important, all rules concerning multiple characters must precede all single character rules.

1. rule

  <rule>
    <from>&#x03B1;&#x03C5;</from> <!-- GREEK SMALL LETTER ALPHA &
                                       GREEK SMALL LETTER UPSILON -->
    <to>au</to>
  </rule>

2. rule

  <rule>
    <from>&#x03B1;</from>         <!-- GREEK SMALL LETTER ALPHA -->
    <to>a</to>
  </rule>

If you switch the order of the rules in the above example, every single alpha would be transliterated first and the digraph pattern will never match.

Building a development version

Your new transliteration table has to be converted to a Perl data structure and stored in xml/tables.dump in order to be put to use and tested as a development version of Lingua::Translit.

xml2dump.pl is a tool that processes XML transliteration table definitions and converts them to Perl data structures. Normally, all stable transliteration tables are processed once and stored in xml/tables.dump and included in the Lingua::Translit::Tables module once at build time.

Using xml2dump.pl

To accomplish this task the xml2dump.pl tool comes in handy:

  alinke$ ./xml2dump.pl -v -o tables.dump mytable.xml 
  Parsing mytable.xml... (MyTable: rules=2, contexts=1)
  1 transliteration table(s) dumped to tables.dump.

It reads an XML definition, processes it and dumps the resulting data structure to a given file (-o switch).

Your transliteration table is now ready to be included by Lingua::Translit::Tables so it can be tested and evaluated.

Building a temporary Lingua::Translit

Use the standard toolchain to build a temporary development version of Lingua::Translit which contains nothing but your new transliteration table.

  alinke$ perl Makefile.PL && make

Given the resulting development version, it's time to test the transliteration table for completeness and correct functionality.

Testing the transliteration table

To verify that your set of transliteration rules works correctly, you need to make some tests using your favorite Perl test framework. For an easy and complete example that utilizes the Test::More framework, have a look at the following example:

t/11_tr_Common_DEU.t

Lingua::Translit comes with a ready to use test template that you could use as a starting point and suite it to your transliterations specific needs. It is located at t/xx_tr_template.t.pl - to follow Lingua::Translit's naming convention, rename it to NN_tr_NAME.t.

Online version of the template: t/xx_tr_template.t.pl

Hints on what to test

  • If your transliteration is straight forward (only "1:1" mappings), just test a small text and have a look at the result. At best, everything is correct an you are ready.

  • If the transliteration is reversible, you should check, if both directions are transliterating correctly.

  • All the context-sensitive and multi-character transliterations should be tested explicitly, to assure, that the error-prone mappings also work as expected.

Running the Tests

While testing it is convenient to define the environment variable PERL5LIB (have a look at perlrun(1)) so that the Perl interpreter knows where your development version of Lingua::Translit is located. The following example session assumes that you are using bash(1) or a similar shell:

  alinke$ export PERL5LIB="blib/lib"
  alinke$ perl t/66_tr_mytest.t
  1..2
  ok 1 - MyTable: not reversible
  ok 2 - MyTable: transliteration

If all tests work as expected and hence your transliteration table is ready for usage, clean up your shell's environment and prepare to integrate your table into the existing set of transliteration tables:

  alinke$ unset PERL5LIB

Integrating the new table

Change to the xml/ directory and let make(1) call xml2dump.pl in order to build a data structure ("tables.dump") from all available XML transliteration tables, including yours:

  alinke$ make all-tables

Now, clean up the old files from the development version you used to write your tests. Change into the source directory's root and run

  alinke$ make distclean && perl Makefile.PL && make

The result is a complete version of Lingua::Translit that contains all upstream tables, as well as your own addition.

  alinke$ make test

...assures everything is alright and ready for installation or packaging. Congratulations!

Contributing your table

If you like to contribute your transliteration table under the terms of the GPL/Artistic License, it can be included in the official upstream version. To accomplish this, create a patch of your changes and send it along with a description and comments to perl@lingua-systems.com so it can be part of the next release.

Valid XHTML 1.0 Strict
© 2008 Alex Linke & Rona Linke
© 2009 Lingua-Systems Software GmbH