NAME
tconv - iconv-like interface with automatic charset detection
SYNOPSIS
#include <tconv.h>
size_t tconv(tconv_t cd,
char **inbuf, size_t *inbytesleft,
char **outbuf, size_t *outbytesleft);
DESCRIPTION
tconv is like iconv, but without the need to know the input charset. Caller might want to play with macros e.g.
#define iconv_t tconv_t
#define iconv_open(tocode, fromcode) tconv_open(tocode, fromcode)
#define iconv(cd, ipp, ilp, opp, olp) tconv(cd, ipp, ilp, opp, olp)
#define iconv_close(cd) tconv_close(cd)
When calling tconv_open:
tconv_open(const char *tocode, const char *fromcode)
it is legal to have NULL for fromcode
. In this case the first chunk of input will be used for charset detection, it is therefore recommended to use enough bytes at the very beginning. If fromcode
is not NULL, no charset detection will occur, and tconv will behave like iconv(3), modulo the engine being used (see below). If tocode
is NULL, it will default to fromcode
.
Testing if the return value is equal to (size_t) -1
or not, together with errno
value when it is (size_t) -1
as documented for iconv
, is the only reliable check: in theory it should return the number of non-reversible characters, and this is what will happen is this is iconv
running behind. In case of another convertion engine, the return value depend on this engine capabilities, or how the corresponding plugin is implemented.
When the number of bytes left in the input is 0
, the return value is equal to (size_t) -1
, and errno
is E2BIG
: you should not count on *ilp
position: the conversion engine may have an internal staging array that have consumed all the input bytes, but is waiting for more space to produce the output bytes. This is happening for instance:
- with the ICU convert engine
-
Regardless if you use
//TRANSLIT
option or not, the ICU convert engine is always doing two conversions internally, one from input encoding to UTF-16, then from UTF-16 to output encoding. This means that it is always eating entirely the input bytes into an internal staging area. - with the ICONV convert engine
-
When input and output encodings are of the same family, then iconv is turned into a validation mode, and is doing internally two conversions, like the ICU plugin. This is also using an internal staging area that always consumes all input bytes before converting them. If the input family is the same as "UTF-8", then the internal staging area is of type "UTF-32", else the internal staging area is of type "UTF-8".
ENGINES
tconv support two engine types: one for charset detection, one for character conversion, please refer to the tconv_open_ext documentation for technical details. Engines, whatever their type, are supposed to have three entry points: new
, run
and free
. They can be:
- external
-
The application already have the
new
,run
andfree
entry points. - plugin
-
The application give the path of a shared library, and tconv will look at it.
- built-in
-
Python's cchardet charset detection engine, bundled with tconv, is always available. If tconv is compiled with ICU support, then ICU charset and conversion engines will be available. If tconv is compiled with ICONV support, then ICONV conversion engine will be available.
DEFAULTS
- charset detection
-
The default charset detection engine is cchardet, bundled statically with tconv.
- character conversion
-
The default character conversion engine is ICU, if tconv has been compiled with ICU support, else ICONV if compiled with ICONV support, else none.
NOTES
- Windows platform
-
On Windows, an ICONV-like conversion engine is always available, via the win-iconv package, bundled with tconv.
- iconv compliance
-
- semantics
-
tconv() only guarantees that his plug-ins support the
//TRANSLIT
and//IGNORE
iconv notation. - output
-
It is guaranteed that tconv() will behave exactly like iconv() if the character conversion engine is ICONV on an UNIX platform, since in this case tconv() will call iconv() internally. In any other case, the plug-ins have a best-effort policy to behave like
iconv
. - POSIX compliance
-
By POSIX compliance, we mean that, when the output buffer is too small, iconv should stop updating the input and output pointers prior to when the limit is reached. When the character conversion engine is ICONV on an UNIX platform, it is the behaviour of this UNIX platform that happen. In any other case, the plug-ins guarantee at least that input and output pointers are left in a state that, if being called again, will correctly handle the continuation of the conversion.
- Iconv plugin
-
Iconv plugin is a direct interface to
iconv
library when tconv is compiled. Though, iconv implementations are not uniform. tconv applies the following://TRANSLIT
support-
If native iconv does not support
//TRANSLIT
, this option is silently removed. Only the uppercased//TRANSLIT
is checked. //IGNORE
support-
If native iconv does not support
//IGNORE
, this option is silently removed. Only the uppercased//IGNORE
is checked. - Identical charsets
-
If charsets appears to be identical (case insensitive), tconv is trusting the user and no call to iconv will happen: data will be transfered as-is.
Because of this non-uniform implementation of iconv, it is recommended to have ICU available at least available when you build tconv.