Changes for version 1.35 - 2013-08-13

  • improve conversion of certain composed characters to MARC8
    • Some characters should not be fully decomposed before converting them to MARC8. This patch adds a table of such characters, based on Annex A of http://www.loc.gov/marc/marbi/2006/2006-04.html and on some sample records provided by Jason Stephenson of MVLC.
  • recognize G0 and G1 characters properly
    • When converting from MARC8 to UTF8, MARC::Charset now properly recognizes if a (single-byte) MARC8 character falls in G0 or G1.
    • This is part of the fix for RT#63271 (converting characters in the Extended Cyrillic character set), but should also fix similar issues with converting characters in the extended Arabic set.
    • This commit also means that all MARC8 character sets that support both G0 and G1 wll be properly converted, regardless of whether they're currently set as the G0 or G1 character set. For example, it is now possible to convert Extended Latin as G0 or Basic Latin as G1.
    • This fixes RT#63271
  • have MARC::Charset::Code->marc_value() handle G0/G1 conversion
    • Since there's at present no need to do things like have ANSEL be the G0 character set when converting from UTF8 to MARC8, this commit centralizes the logic for deciding whether to return the G0 or G1 MARC8 representation of a character.
    • Also add MARC::Charset::Code->g0_marc_value(), which returns the G0 representation of the character for use by the character DB.
  • New test cases for converting Vietnamese and Extended Cyrillic text.

Documentation

compile the LoC mapping table
print the marc8 conversion table as HTML

Modules

convert MARC-8 encoded strings to UTF-8
represents a MARC-8/UTF-8 mapping
compile XML mapping rules from LoC
constants for MARC::Charset
character mapping db