TITLE
sh2sh - convert Shoebox data to Unicode
SYNOPSIS
sh2sh -s settings_dir [-c codepage] [-e encs] infile [outfile]
Converts Shoebox data to Shoebox converting to Unicode as it goes.
OPTIONS
-b Delete empty fields
-c codepage Set default codepage conversion, otherwise none
-e enc,enc Add Encoding:: subsets in Perl 5.8.1
-f type Force database type
-n normalform normalize unicode text to D,C,KD,KC form
-s dir Directory to find .typ files in [.]
-t type Generate Toolbox database of given type
If outfile is missing, it is created as the input file with extension replaced by .db1. This allows a user to drop a data file on a shortcut.
DESCRIPTION
sh2sh converts a Shoebox (or Toolbox) database to Unicode. In particular it
Sonverts strings according to whcih field they are in and the corresponding language
Lays out interlinear text so that it remains as interlinear text when the corresponding underlying strings have changed length.
Using sh2sh involves two aspects: preparing for conversion in terms of giving information about encoding conversion; and running the program, knowing what command line option does what.
Running sh2sh
Here we list the various command line options and give further details on each
- -b
-
Any empty fields in the input file will be deleted.
- -c
-
Specifies the default codepage to be used when converting data. In effect it specifies that sh2sh should act as though it were running on a system with the given default codepage. This means that data in languages with no given encoding conversion will be converted using this codepage.
- -e
-
Perl has internal support for a large number of industry standard encodings. This option specifies which sets to pull in apart from the default set. Values include
Byte - standard ISO 8859 type single byte encodings CN - Continental China encodings including cp 936, GB 12345 and GB 2312 JP - Japanese encodings including cp 932 and ISO 2022 KR - Korean encodings including cp 949 TW - Taiwanese encodings including cp 950 HanExtra - more Chinese encodings including GB 18030 JIS2K - More Japanese encodings Ebcdic - surely not! Symbols - various symbol encodings
See man Encode::Supported or the corresponding module documentation for details of what is supported on your Perl installation.
- -f
-
Rather than analysing the data in the file using the database type specified in the database, it is possible to specify that a different one should be used.
- -n
-
Particularly with respect to Roman script languages involving letters with diacritics, there are two options as to how these are to be stored. They can either be stored as a single code (if such exists in Unicode) in which case the form to be asked for is C (composed), otherwise they can be stored using separate codes for base and diacritic and the normal form is D (decomposed). There are other normal forms which should only be used if you really know what you are doing (and then you will know why they shouldn't be used).
- -s
-
sh2xml requires access to information about the structure of the database and language information. This is held in files in the same directory as the
.prj
project file used when running Shoebox/Toolbox. - -t
-
Gives the name of a database type that is given to the output file. Since the encoding has changed, the old database type is no longer appropriate for the output data. If a new database type has already been created that makes reference to the appropriate languages based on Unicode. In order to access the old database type name as part of the new name, all occurrences of the string
%T
in the-t
option will be replaced with the old database type name.
Preparing for Conversion
The basic need is to be able to specify how to convert text in a particular language into Unicode. This can be done by specifying a conversion mapping in each language file. Shoebox and Toolbox do not have a UI for specifying such conversion information, so we add information to the options/description field. The codepage specification takes the form:
\codepage = value
The specification needs to be on a line on its own. The value can take a number of forms.
- name
-
A mapping name either from the set of names supported by the Perl Encode module, or specified in an SIL Converters repository.
- filename.tec
-
The path and filename of a TECkit binary mapping file. The path is relative to the settings directory.
- none
-
No mapping should be done. The data is assumed to be in UTF-8 encoding.