The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

UTF8::R2 - makes UTF-8 scripting easy for enterprise use or LTS

SYNOPSIS

use UTF8::R2;
use UTF8::R2 ver.sion;            # match or die
use UTF8::R2 qw( RFC3629 );       # m/./ matches RFC3629 codepoint (default)
use UTF8::R2 qw( RFC2279 );       # m/./ matches RFC2279 codepoint
use UTF8::R2 qw( WTF8 );          # m/./ matches WTF-8 codepoint
use UTF8::R2 qw( RFC3629.ja_JP ); # optimized RFC3629 for ja_JP
use UTF8::R2 qw( WTF8.ja_JP );    # optimized WTF-8 for ja_JP
use UTF8::R2 qw( %mb );           # multibyte regex by %mb

  $result = UTF8::R2::chop(@_)
  $result = UTF8::R2::chr($utf8octet_not_unicode)
  $result = UTF8::R2::getc(FILEHANDLE)
  $result = UTF8::R2::index($_, 'ABC', 5)
  $result = UTF8::R2::lc($_)
  $result = UTF8::R2::lcfirst($_)
  $result = UTF8::R2::length($_)
  $result = UTF8::R2::ord($_)
  $result = UTF8::R2::qr(qr/$utf8regex/imsxo) # no /gc
  @result = UTF8::R2::reverse(@_)
  $result = UTF8::R2::reverse(@_)
  $result = UTF8::R2::reverse()
  $result = UTF8::R2::rindex($_, 'ABC', 5)
  @result = UTF8::R2::split(qr/$utf8regex/, $_, 3)
  $result = UTF8::R2::substr($_, 0, 5)
  $result = UTF8::R2::tr($_, 'A-C', 'X-Z', 'cdsr')
  $result = UTF8::R2::uc($_)
  $result = UTF8::R2::ucfirst($_)

  use UTF8::R2 qw(%mb);
  $result = $_ =~ $mb{qr/$utf8regex/imsxo}
  $result = $_ =~ m<\G$mb{qr/$utf8regex/imsxo}>gc
  $result = $_ =~ s<$mb{qr/before/imsxo}><after>egr

Octet-Semantics Functions vs. Codepoint-Semantics Subroutines

This software adds the ability to handle UTF-8 code points to bare Perl; it does not provide the ability to handle characters and graphene with UTF-8. (Time is on our side, so let's all be excited for the day when code points represent graphene.) Because this module override nothing, the functions of bare Perl provide octet semantics continue. UTF-8 codepoint semantics is provided by the new subroutine name.

------------------------------------------------------------------------------------------------------------------------------------------
Octet-semantics         UTF-8 Codepoint-semantics
by traditional name     by new name                                Note and Limitations
------------------------------------------------------------------------------------------------------------------------------------------
chop                    UTF8::R2::chop(@_)                         usually chomp() is useful
------------------------------------------------------------------------------------------------------------------------------------------
chr                     UTF8::R2::chr($_)                          returns UTF-8 codepoint octets by UTF-8 hex number (not by Unicode number)
------------------------------------------------------------------------------------------------------------------------------------------
getc                    UTF8::R2::getc(FILEHANDLE)                 get UTF-8 codepoint octets
------------------------------------------------------------------------------------------------------------------------------------------
index                   UTF8::R2::index($_, 'ABC', 5)              index() is compatible and usually useful
------------------------------------------------------------------------------------------------------------------------------------------
lc                      UTF8::R2::lc($_)                           works as tr/A-Z/a-z/, universally
------------------------------------------------------------------------------------------------------------------------------------------
lcfirst                 UTF8::R2::lcfirst($_)                      see UTF8::R2::lc()
------------------------------------------------------------------------------------------------------------------------------------------
length                  UTF8::R2::length($_)                       length() is compatible and usually useful
------------------------------------------------------------------------------------------------------------------------------------------
// or m// or qr//       UTF8::R2::qr(qr/$utf8regex/imsxo)          not supports metasymbol \X that match grapheme
                        m<@{[UTF8::R2::qr(qr/$utf8regex/imsxo)]}>gc
                          or                                       not supports named character (such as \N{GREEK SMALL LETTER EPSILON}, \N{greek:epsilon}, or \N{epsilon})
                        use UTF8::R2 qw(%mb);                      not supports character properties (like \p{PROP} and \P{PROP})
                        $mb{qr/$utf8regex/imsxo}                   modifier i, m, s, x, o work on compile time
                        m<\G$mb{qr/$utf8regex/imsxo}>gc            modifier g,c work on run time

                        Special Escapes in Regex                   Support Perl Version
                        --------------------------------------------------------------------------------------------------
                        $mb{qr/ \x{UTF8hex} /}                     since perl 5.005
                        $mb{qr/ [\x{UTF8hex}] /}                   since perl 5.005
                        $mb{qr/ [[:POSIX:]] /}                     since perl 5.005
                        $mb{qr/ [[:^POSIX:]] /}                    since perl 5.005
                        $mb{qr/ [^ ... ] /}                        ** CAUTION ** perl 5.006 cannot this
                        $mb{qr/ [\x{UTF8hex}-\x{UTF8hex}] /}       since perl 5.008
                        $mb{qr/ \h /}                              since perl 5.010
                        $mb{qr/ \v /}                              since perl 5.010
                        $mb{qr/ \H /}                              since perl 5.010
                        $mb{qr/ \V /}                              since perl 5.010
                        $mb{qr/ \R /}                              since perl 5.010
                        $mb{qr/ \N /}                              since perl 5.012
                        (max \x{UTF8hex} is \x{7FFFFFFF}, so cannot 4 octet codepoints, pardon me please!)
------------------------------------------------------------------------------------------------------------------------------------------
?? or m??                 (nothing)
------------------------------------------------------------------------------------------------------------------------------------------
ord                     UTF8::R2::ord($_)                          returns UTF-8 number (not Unicode number) by UTF-8 codepoint octets
------------------------------------------------------------------------------------------------------------------------------------------
pos                       (nothing)
------------------------------------------------------------------------------------------------------------------------------------------
reverse                 UTF8::R2::reverse(@_)
------------------------------------------------------------------------------------------------------------------------------------------
rindex                  UTF8::R2::rindex($_, 'ABC', 5)             rindex() is compatible and usually useful
------------------------------------------------------------------------------------------------------------------------------------------
s/before/after/imsxoegr s<@{[UTF8::R2::qr(qr/before/imsxo)]}><after>egr
                          or
                        use UTF8::R2 qw(%mb);
                        s<$mb{qr/before/imsxo}><after>egr
------------------------------------------------------------------------------------------------------------------------------------------
split//                 UTF8::R2::split(qr/$utf8regex/imsxo, $_, 3)  *CAUTION* UTF8::R2::split(/re/,$_,3) means UTF8::R2::split($_ =~ /re/,$_,3)
------------------------------------------------------------------------------------------------------------------------------------------
sprintf                   (nothing)
------------------------------------------------------------------------------------------------------------------------------------------
substr                  UTF8::R2::substr($_, 0, 5)                 substr() is compatible and usually useful
                                                                   :lvalue feature needs perl 5.014 or later
------------------------------------------------------------------------------------------------------------------------------------------
tr/// or y///           UTF8::R2::tr($_, 'A-C', 'X-Z', 'cdsr')     range of codepoint by hyphen supports ASCII only
------------------------------------------------------------------------------------------------------------------------------------------
uc                      UTF8::R2::uc($_)                           works as tr/a-z/A-Z/, universally
------------------------------------------------------------------------------------------------------------------------------------------
ucfirst                 UTF8::R2::ucfirst($_)                      see UTF8::R2::uc()
------------------------------------------------------------------------------------------------------------------------------------------
write                     (nothing)
------------------------------------------------------------------------------------------------------------------------------------------

UTF8 Flag Considered Harmful, and Our Goals

P.401 See chapter 15: Unicode of ISBN 0-596-00027-8 Programming Perl Third Edition.

Before the introduction of Unicode support in perl, The eq operator just compared the byte-strings represented by two scalars. Beginning with perl 5.8, eq compares two byte-strings with simultaneous consideration of the UTF8 flag.

-- we have been taught so for a long time.

Perl is a powerful language for everyone, but UTF8 flag is a barrier for common beginners. Because everyone can only one task on one time. So calling Encode::encode() and Encode::decode() in application program is not better way. Making two scripts for information processing and encoding conversion may be better. Please trust me.

/*
 * You are not expected to understand this.
 */

 Information processing model beginning with perl 5.8

   +----------------------+---------------------+
   |     Text strings     |                     |
   +----------+-----------|    Binary strings   |
   |  UTF-8   |  Latin-1  |                     |
   +----------+-----------+---------------------+
   | UTF8     |            Not UTF8             |
   | Flagged  |            Flagged              |
   +--------------------------------------------+
   http://perl-users.jp/articles/advent-calendar/2010/casual/4

 Confusion of Perl string model is made from double meanings of
 "Binary string."
 Meanings of "Binary string" are
 1. Non-Text string
 2. Digital octet string

 Let's draw again using those term.

   +----------------------+---------------------+
   |     Text strings     |                     |
   +----------+-----------|   Non-Text strings  |
   |  UTF-8   |  Latin-1  |                     |
   +----------+-----------+---------------------+
   | UTF8     |            Not UTF8             |
   | Flagged  |            Flagged              |
   +--------------------------------------------+
   |            Digital octet string            |
   +--------------------------------------------+

There are people who don't agree to change in the character string processing model of Perl 5.8. It is impossible to get to agree it to majority of Perl user who hardly ever use Perl. How to solve it by returning to an original method, let's drag out page 402 of the Programming Perl, 3rd ed. again.

Information processing model beginning with perl3 or this software
of UNIX/C-ism.

  +--------------------------------------------+
  |    Text string as Digital octet string     |
  |    Digital octet string as Text string     |
  +--------------------------------------------+
  |       Not UTF8 Flagged, No MOJIBAKE        |
  +--------------------------------------------+

In UNIX Everything is a File
- In UNIX everything is a stream of bytes
- In UNIX the filesystem is used as a universal name space

Ideally, We'd like to achieve these five Goals:

  • Goal #1:

    Old byte-oriented programs should not spontaneously break on the old byte-oriented data they used to work on.

    This goal has been achieved by that this software is additional code for perl like utf8 pragma. Perl should work same as past Perl if added nothing.

  • Goal #2:

    Old byte-oriented programs should magically start working on the new character-oriented data when appropriate.

    Not "magically." You must decide and write octet semantics or UTF-8 codepoint semantics yourself in case by case. Perhaps almost all regular expressions should have UTF-8 codepoint semantics. And other all should have octet semantics.

  • Goal #3:

    Programs should run just as fast in the new character-oriented mode as in the old byte-oriented mode.

    It is almost possible. Because UTF-8 encoding doesn't need multibyte anchoring in regular expression.

  • Goal #4:

    Perl should remain one language, rather than forking into a byte-oriented Perl and a character-oriented Perl.

    UTF8::R2 module remains one language and one interpreter by providing codepoint semantics subroutines.

  • Goal #5:

    UTF8::R2 module users will be able to maintain it by Perl.

    May the UTF8::R2 be with you, always.

Back when Programming Perl, 3rd ed. was written, UTF8 flag was not born and Perl is designed to make the easy jobs easy. This software provides programming environment like at that time.

Perl's Motto

  Some computer scientists (the reductionists, in particular) would
 like to deny it, but people have funny-shaped minds. Mental geography
 is not linear, and cannot be mapped onto a flat surface without
 severe distortion. But for the last score years or so, computer
 reductionists have been first bowing down at the Temple of Orthogonality,
 then rising up to preach their ideas of ascetic rectitude to any who
 would listen.

  Their fervent but misguided desire was simply to squash your mind to
 fit their mindset, to smush your patterns of thought into some sort of
 Hyperdimensional Flatland. It's a joyless existence, being smushed.
 --- Learning Perl on Win32 Systems

 If you think this is a big headache, you're right. No one likes
 this situation, but Perl does the best it can with the input and
 encodings it has to deal with. If only we could reset history and
 not make so many mistakes next time.
 --- Learning Perl 6th Edition

  The most important thing for most people to know about handling
 Unicode data in Perl, however, is that if you don't ever use any Uni-
 code data -- if none of your files are marked as UTF-8 and you don't
 use UTF-8 locales -- then you can happily pretend that you're back in
 Perl 5.005_03 land; the Unicode features will in no way interfere with
 your code unless you're explicitly using them. Sometimes the twin
 goals of embracing Unicode but not disturbing old-style byte-oriented
 scripts has led to compromise and confusion, but it's the Perl way to
 silently do the right thing, which is what Perl ends up doing.
 --- Advanced Perl Programming, 2nd Edition

AUTHOR

INABA Hitoshi <ina@cpan.org>

This project was originated by INABA Hitoshi.

LICENSE and COPYRIGHT

This software is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See the LICENSE file for details.

This software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.