Integrator Unicode Data Support
Integrator includes support for Unicode characters in order to process text written in non-European languages, such as Chinese, Thai, Hebrew, and Japanese. Earlier versions of Integrator, prior to 2.2(20), assumed that data was written in the standard ISO-8859-1 character set, commonly referred to as Latin1. This is a 256 valued character set, with the first 128 characters matching the ASCII character set and the last 128 characters defining accented characters common in European languages. It is commonly referred to as the ANSI code page, although that is technically not true.
Unicode provides support for all characters in known human languages, defining more than 128,000 characters. These values can no longer be stored as one character per byte, so each character needs to be encoded using multiple bytes per character. The two most popular encoding methods are UTF-8 and UCS-2. UTF-8 is commonly used on UNIX systems and Internet data; UCS-2 is the common encoding for Microsoft applications. Integrator can read or write double-byte character data using either the UTF-8 or UCS-2 encodings; internally, it processes data as UTF-8. Integrator supports handling and writing out Unicode supplementary characters (4-byte UTF-8 characters).
To maintain compatibility with old scripts that process Latin1 data, Integrator defaults to treating input and output data as Latin characters. A script written to process Latin1 data will continue to work automatically without any changes. To handle Unicode data, input objects and output objects have an encoding attribute that defines the encoding used to read or write data as follows:
- If this attribute has the value "ascii" or "latin1", then the data will be treated as latin1 characters.
- If the value is a unicode encoding ("unicode" for UCS-2, "utf-8" for UTF-8 data), then the data will be processed as Unicode characters.
To avoid having to set the encoding for every input or output object in a task, Integrator detects whether any object in a task processes Unicode characters or whether the Integrator script itself is declared to be written as UTF-8 (that is charset 1208 is set). If so, the default encoding for that task becomes UTF-8. Any object that does not define an encoding will assume the data is UTF-8. With this approach, an Integrator developer can set the encoding in the output object to be UTF-8, and all input objects for that task will default to be reading UTF-8 data. This can be overridden by setting an encoding attribute in the input objects.
Some objects, such as the List input object, read and process data contained in an Integrator script itself. Column names are frequently described and named in various process objects. If a task processes Unicode characters, then all strings, including filenames contained in the object language should be entered as UTF-8. To force an object language file (script) to be interpreted as UTF-8, the character set of the object language file should be defined as 1208, which is the IBM code page number for UTF-8. The character set can be defined for an object language file using the charset header that appears right after the version header. For example:
version "1"; charset 1208; object 'TSKL' "main" {
In Visual Integrator, this is set under "Main". This assures that the strings defined in the object language file will be interpreted as UTF-8 Unicode characters. Any object in that language file will default to a Unicode encoding of its strings, and the default encoding "auto" for that object will mean "UTF-8".
A Unicode license is not needed to create Unicode-encoded Models, but a license is required for DiveLine or Diver to properly display these characters. Without a license, garbage will be displayed. A work around is to save the input text file with ANSI encoding and build—the Model will keep the accents in the data and show correctly in domestic Diver.
If you are using the Unicode versions of DiveLine/ProDiver, we recommend that you start encoding your scripts in UTF-8, and declare them as such by using the "charset 1208;" directive. Any Latin1 (ANSI) inputs will have to be declared in their respective input objects by setting the encoding attribute to "latin1".
The Unicode standard defines a Byte Order Mark (BOM) that is used to identify a file as containing Unicode data. This is a sequence of bytes appearing at the beginning of the file that do not get processed as data. It is commonly referred to as a "Unicode signature", and it is also used to distinguish the byte order of a UCS-2 file. By default, the filein and ftp input objects will automatically detect the presence of a Unicode signature and process the data accordingly.
When inputting UTF-8, you have two choices:
- Use a Unicode Builder/DiveLine/ProDiver, so the Model you create and use will contain UTF-8.
- Transcode the UTF-8 to single-byte encoding (that is, Latin1) and use the standard Builder/DiveLine/ProDiver.
- This second option should only be used if you have Unicode characters that have single-byte equivalents but none that do not. For example, if you are incorporating Swedish characters, but not Chinese characters.
NOTE:
- Do not use Non-Unicode DI clients with a Unicode DiveLine server, or vice-versus.
- Non-Unicode versions of DI client such as ProDiver, NetDiver, DivePort, and DIAL can properly handle accented and other high-bit characters in cBases. This includes when such characters are used in the data as well as when used as column names and labels.
The following Integrator script converts data from a standard tab-delimited UCS-2 Unicode file to one that is encoded in UTF-8.
version "1"; object 'TSKL' "Main" { { "main" } }; object 'TASK' "main" { inputs = { "in" }, output = "out" }; object 'INPT' "in" { input_type = "Filein", file_type = "column_headers filename = "input.txt", encoding = "unicode", }; object 'OUTP' "out" { output_type = "fileout", input = "in", filename = "output.txt", file_type = "column_headers", encoding = "utf-8", };
NOTE: Integrator objects and their attributes must be in English. Only string values (enclosed in double quotes) can accept an extended character set.
When encoding is wrong, the file output will contain question marks (????) in place of unrecognized characters.
See also: Working with Unicode.