NAME

Class::DBI::utf8 - A Class:::DBI subclass that knows about UTF-8

SYNOPSIS

This module is a Class::DBI plugin:

package Foo;
use base qw( Class::DBI );
use Class::DBI::utf8;

...
__PACKAGE__->columns( All => qw( id text other ) );

# the text column contains utf8-encoded character data
__PACKAGE__->utf8_columns(qw( text ));
...

# create an object with a nasty character.
my $foo = Foo->create({
  text => "a \x{2264} b for some a",
});

# search for utf8 chars.
Foo->search( text => "a \x{2264} b for some a" );

DESCRIPTION

Rather than have to think about things like character sets, I prefer to have my objects just Do The Right Thing. I also want utf-8 encoded byte strings in the database whenever possible. Using this subclass of Class::DBI, I can just put perl strings into the properties of an object, and the right thing will always go into the database and come out again.

For example, without Class::DBI::utf8,

MyObject->create({ id => 1, text => "\x{2264}" }); # a less-than-or-equal-to symbol

..will create a row in the database containing (probably) the utf-8 byte encoding of the less-than-or-equal-to symbol. But when trying to retrieve the object again..

my $broken = MyObject->retrieve( 1 );
my $text = $broken->text;

... $text will (probably) contain 3 characters and look nothing like a less-than-or-equal-to symbol. Likewise, you will be unable to search properly for strings containing non-ascii characters.

Creating objects with simpler non-ascii characters from the latin-1 range will lead to even stranger behaviours:

my $e_acute = "\x{e9}"; # an e-acute
MyObject->create({ text => $e_acute });

utf8::upgrade($e_acute); # still the same letter, but with a different
                         # internal representation
MyObject->create({ text => $e_acute });

This will create two rows in the database - the first containing the latin-1 encoded bytes of an e-acute character (or the database may refuse to let you create the row, if it's been set up to require utf-8), the latter containing the utf-8 encoded bytes of an e-acute. In the latter case you won't get an e-acute back out again if you retrieve the row; You'll get a string containing two characters, one for each byte of the utf-8 encoding.

Because of this, if you're handling data from an outside source, you won't really have any clear idea of what will be going into the database at all.

Fortunately, simply adding the lines:

use Class::DBI::utf8;
__PACKAGE__->utf8_columns("text");

will make all these operations work much more as expected - the database will always contain utf-8 bytes, you will always get back the characters you put in, and you will instantly become the most popular person at work.

This module assumes that the underlying database and driver don't know anything about character sets, and just store bytes. Some databases, for instance postgresql and later versions of mysql, allow you to create tables with utf-8 character sets, but the Perl DB drivers don't respect this and still require you to pass utf-8 bytes, and return utf-8 bytes and hence still need special handling with Class::DBI.

Class::DBI::utf8 will do the right thing in both cases, and I would suggest you tell the database to use utf-8 encoding as well as using Class::DBI::utf8 where possible.

CAVEATS

This module requires perl 5.8.0 or later - if you're still using 5.6, and you want to use unicode, I suggest you don't. It's not nice.

Be aware that utf-8 encoded strings will commonly have a byte length greater than their character length - this is because non-ascii characters such as e-actute will encode to two bytes, and other characters can be encoded to other numbers of bytes, although 2 or 3 bytes are typical. If your database has an underlying data type of a limited length, for instance a CHAR(10), you may not be able to store 10 characters in it.

Internally, the module is futzing with the _utf8_on and _utf8_off methods. If you don't know why doing that is probably a bad idea, you should read into it before you start trying to do this sort of thing yourself. I'd prefer to use encode_utf8 and decode_utf8, but I have my reasons for doing it this way - mostly, it's so that we can allow for DBD drivers that do know about character sets.

Finally, the database may have some internal string-handling functions, for instance LOWER(), UPPER(), various sorting functions, etc. If the database is properly utf-8 aware, it may do the right thing to the utf-8 encoded strings in the database if you use these functions. But I've never seen a database do the right thing. Likewise, there are all sorts of nasty normalisation considerations when performing searches that are outside of the scope of these docs to discuss, but which can really ruin your day.

BUGS

I've attempted to make the module keep doing the Right Thing even when the DBD driver for the database knows what it's doing, ie, if you give it sensible perl strings it'll store the right thing in the database and recover the right thing from the database. However, I've been forced to assume that, in this eventuality, the database driver will hand back strings that already have the utf-8 bit set. If they don't, things will break. On the bright side, they'll break really fast. I also find it extremely unlikely that anyone would bother reducing strings to latin1 internally.

Also, I've been forced to override the _do_search method to make searching for utf8 strings work, so if you override it locally as well, bad things will happen. Sorry.

Incredible popularity and fame gained through understanding of utf-8 may not actually be real.

SEE ALSO

Class::DBI

AUTHOR

Tom Insam <tinsam@fotango.com>

Copyright Fotango 2005. All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.