The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Lingua::ZH::HanConvert - convert between Traditional and Simplified Chinese characters

SYNOPSIS

    #!perl -lw
    use Lingua::ZH::HanConvert qw(simple trad);
    use utf8;
    
    my $t = "國"; # Traditional symbol for "country", unicode 22283
	# or: my $t = v22283;

    print simple($t); # Simplified "country", 国 (unicode 22269)
    
    $s = "é±¼"; # Simplified symbol for "fish", unicode 40060
	# or: $s = v40060;

    print trad($s); # Traditional "fish", é­š (unicode 39970)

REQUIRES

Perl 5.6

DESCRIPTION

In the 1950's, the Chinese government simplified over 2000 Chinese characters. Taiwan and Hong Kong still use the traditional characters. The simplified characters are hard to read if you only know the traditional ones, and vice-versa.

This module attempts to convert Chinese text between the two forms, using character-by-character transliteration.

Note that this module only handles text in the Unicode UTF-8 character set. If you need to convert between the Big5 and GB character sets, then please look at Text::IConv.

simple takes a string, converts any traditional Chinese characters (such as 國, unicode U+570B, meaning "country") to the corresponding simplified characters (like 国, unicode U+56FD, also meaning "country"), and returns the result. Characters which are not traditional Chinese do not change.

trad does the reverse; it converts any simplified Chinese characters to the corresponding traditional characters. Characters which are not simplified Chinese do not change.

If a simplified character has two or more corresponding traditional characters, then it will be replaced by all of them, enclosed in square brackets. To use different characters instead of the square brackets, give them as the second and third arguments to trad. The same applies where a traditional character has two or more corresponding simplified forms, but this happens much more rarely.

BUGS, LIMITATIONS

There may be mistakes in the transliterations. A number of data sources were used to build the transliteration tables, including dictionaries and the Unicode consortium's Unihan database, but some mappings may be incorrect or missing.

Some characters which are simplified forms are also traditional forms. For example, 面, unicode U+9762, is the simplified form of 麵, unicode U+9EB5, meaning "noodles"; but it is also the character for "face" in both traditional and simplified writing. Since most references about simplified characters are designed for humans, they do not mention the latter type of mapping, since a human who came across such a character could use common sense to understand it. To provide this module with that extra information, it has been assumed that any simplified form which appears in the Big5 character set is also a traditional form. In some cases, this assumption may be incorrect, or insufficient (i.e. there may be simplified forms which are also traditional forms but do not appear in Big5).

The transliteration mappings could be improved. Ideally, I'd like to see the module performing word-by-word transliteration, if suitable data sources were available. See http://www.basistech.com/articles/C2C.html for a discussion of transliteration issues.

The conversions are slow, which may be a problem if you need to process a lot of text. Please let me know if the module is too slow for your purposes; I can probably speed it up if this would be useful.

The characters in this documentation may not display correctly unless the program you are reading it with is unicode-aware.

ACKNOWLEDGEMENTS

The data used by this module is taken from the Unicode consortium's Unihan database, available from ftp://ftp.unicode.org. Thanks to them for compiling the data.

AUTHOR

David Chan <david@sheetmusic.org.uk>

COPYRIGHT

Copyright (C) 2001, David Chan. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 830:

Non-ASCII character seen before =encoding in '"國";'. Assuming CP1252