NAME
CCCP::Encode - Perl extension for character encodings from utf-8 to any cyrillic (koi8-r, windows-1251, etc.)
Version 0.03
SYNOPSIS
use CCCP::Encode;
$CCCP::Encode::ToText = 0; # default
$CCCP::Encode::Entities = 'xml'; # default
my $str = "если в слове 'хлеб' поменять 4 буквы, то получится — ПИВО";
print CCCP::Encode->utf2cyrillic($str,'koi8-r');
# output in koi8-r:
# если в слове 'хлеб' поменять 4 буквы, то получится — ПИВО
$str = "Иероглифы: 牡 マ キ グ ナ ル フ";
print CCCP::Encode->utf2cyrillic($str,'windows-1251');
# output in windows-1251:
# Иероглифы: 牡 マ キ グ ナ ル フ
--------------------------
$CCCP::Encode::ToText = 0; # default
$CCCP::Encode::Entities = 'html';
print CCCP::Encode->utf2cyrillic($str,'koi8-r');
# output in koi8-r:
# если в слове 'хлеб' поменять 4 буквы, то получится — ПИВО
$str = "Иероглифы: 牡 マ キ グ ナ ル フ";
print CCCP::Encode->utf2cyrillic($str,'windows-1251');
# output in windows-1251:
# Иероглифы: 牡 マ キ グ ナ ル フ
--------------------------
$CCCP::Encode::ToText = 1;
print CCCP::Encode->utf2cyrillic($str,'koi8-r');
# output in koi8-r:
# если в слове 'хлеб' поменять 4 буквы, то получится -- ПИВО
$CCCP::Encode::CharMap = {"\x{2014}" => '-'};
print CCCP::Encode->utf2cyrillic($str,'koi8-r');
# output in koi8-r:
# если в слове 'хлеб' поменять 4 буквы, то получится - ПИВО
DESCRIPTION
This module convert utf string to cyrillic in two mode:
convert to cyrillic string with html entites,
convert to cyrillic string to only plain/text character.
By default for unknown character used HTML::Entities
for html entites and for plain/text encoding used Text::Unidecode
. You can override the map to encoding for any character. And can override regexp for replace character.
INTRODUCTION
Ajax library (on frontend) send data in utf-8. If you have backend on koi8-r
, windows-1251
, etc. You have problem:
use Encode;
...
my $data = $post->param('any');
# $data = "если в слове 'хлеб' поменять 4 буквы, то получится — ПИВО";
Encode::from_to($data,'utf-8','koi8-r');
print $data;
# output:
# если в слове 'хлеб' поменять 4 буквы, то получится ? ПИВО
Method from_to
from module Encode
replace uncnown character on '?'. This data go to save in your database. And you write a guano-magic code for fixing this problem. All developers, who have database not in utf, known about this problem.
And another case:
Getting data from rss-channels in utf-8 and saving in cyrillic
database (for example mysql with default charset koi8-r
or windows-1251
).
CCCP::Encode fix this problem.
METHODS
utf2cyrillic($str,$to)
$str
target string. $to
encoding name, analogue $to
in Encode::from_to($str,'utf-8',$to)
PACKAGE VARIABLES
$CCCP::Encode::Entities
Ignored if $CCCP::Encode::ToText is true. Default value 'xml'. 'xml' mode - replace all uncnown character in traget charset to valid xml numeric entities (i.e. —). 'html' mode - replace all uncnown character in traget charset to html numeric entities (i.e. —).
$CCCP::Encode::ToText
Default is false.
If $CCCP::Encode::ToText
is false, when utf2cyrillic
return decode string whis replace uncnown character from you definition (see $CCCP::Encode::CharMap
) or html entities from HTML::Entities
.
If $CCCP::Encode::ToText
is true, when utf2cyrillic
return decode string in plain/text format whis replace uncnown character from you definition (see $CCCP::Encode::CharMap
) or used Text::Unidecode
.
$CCCP::Encode::CharMap
Default is empty hashref.
You can custom define map for any characters. This is wery flexible if you need custom replace (different of HTML::Entities
or Text::Unidecode
). Example:
$CCCP::Encode::CharMap = {
"\x{2014}" => '-',
"\x{2015}" => 'foo'
};
$CCCP::Encode::Regexp
By default value is [^\p{Cyrillic}|\p{IsLatin}|\p{InBasic_Latin}]
- replace any character which not in Cyrillic or Latin map exist. You can override this expression.
See more on http://www.regular-expressions.info/unicode.html
OVERHEAD
CCCP::Encode with $CCCP::Encode::Entities eq "html":
2 wallclock secs ( 1.63 usr + 0.01 sys = 1.64 CPU) @ 60975.61/s (n=100000)
CCCP::Encode with $CCCP::Encode::Entities eq "xml":
3 wallclock secs ( 2.49 usr + 0.00 sys = 2.49 CPU) @ 40160.64/s (n=100000)
CCCP::Encode with $CCCP::Encode::ToText eq "1":
4 wallclock secs ( 3.85 usr + 0.02 sys = 3.87 CPU) @ 25839.79/s (n=100000)
Encode::from_to(...) :
2 wallclock secs ( 1.93 usr + 0.01 sys = 1.94 CPU) @ 51546.39/s (n=100000)
SEE ALSO
Encode
Text::Unidecode
AUTHOR
Ivan Sivirinov