NAME
Text::GuessEncoding - convert Text from almost any encoding to ASCII or UTF8
VERSION
Version 0.01
SYNOPSIS
Text::GuessEncoding searches a string for non-ascii contents and rewrites them using an ASCII replacement. For Example the german a-Umlaut character is replaced by "ae". The input string may or may not have its utf8 flag set correctly; the flag is ignored. The returned string has the utf8 flag always off, and contains no characters above codepoint 127 (which means it is inside the ASCII character set). If called in a list context, to_ascii()
returns the mapping table as a second value. This mapping table is a hash, using all recognized encodings as keys. (Any well-formed string should only have one encoding, but one can never be sure.) Value per encoding is an array ref, listing all the codepoints in the following form: [ [ $codepoint, $replacement_bytecount, [ $offset, ... ] ], ... ]
Offset positions refer to the output string, where byte counts are identical with character counts.
Example:
my $guess = new Text::GuessEncoding();
($ascii, $map) = $guess->to_ascii("J\x{fc}rgen \x{c3}\x{bc}\n");
# $ascii = 'Juergen ue';
# $map = { 'utf8' => [252, 2, [8]], 'latin1' => [252, 2, [1]] };
The input string contains both utf8 encoded u-umlaut glyph and a plain latin1 byte u-umlaut. The output string is never flagged as utf8.
($utf8, $map) = $guess->to_utf8("J\x{fc}rgen \x{c3}\x{bc}\n");
# $utf8 = 'J\N{U+fc}rgen \N{U+fc}';
# $map = { 'utf8' => [7], 'latin1' => [1] };
to_utf8
returns a simpler mapping table, as the string preserves more inforation. Note that the offsets differ from to_ascii(), as no multi-character rewriting takes place. The output string is always flagged as utf8.
use Text::GuessEncoding;
my $asciitext = Text::GuessEncoding::to_ascii($enctext);
my ($asciitext,$mapping) = Text::GuessEncoding::to_ascii($enctext);
EXPORT
to_ascii()
- create plain text in 7-bit ASCII encoding. to_utf8()
- return UTF-8 encoded text .
SUBROUTINES/METHODS
to_ascii
to_ascii()
is implemented in perl code as a post-processor of to_utf8()
. It examines charnames::viacode($_)
and constructs some useful ascii replacements from these. A number of frequently used codepoint values can be precompiled for speed.
AUTHOR
Juergen Weigert, <jw at suse.de>
BUGS
Please report any bugs or feature requests to bug-text-toascii at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-GuessEncoding. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
SUPPORT
You can find documentation for this module with the perldoc command.
perldoc Text::GuessEncoding
You can also look for information at:
RT: CPAN's request tracker
AnnoCPAN: Annotated CPAN documentation
CPAN Ratings
Search CPAN
ACKNOWLEDGEMENTS
LICENSE AND COPYRIGHT
Copyright 2010 Juergen Weigert.
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.