Integrator Unicode Data Support

Integrator includes support for Unicode characters in order to process text written in non-European languages, such as Chinese, Thai, Hebrew, and Japanese. Earlier versions of Integrator, prior to 2.2(20), assumed that data was written in the standard ISO-8859-1 character set, commonly referred to as Latin1. This is a 256 valued character set, with the first 128 characters matching the ASCII character set and the last 128 characters defining accented characters common in European languages. It is commonly referred to as the ANSI code page, although that is technically not true.

Unicode provides support for all characters in known human languages, defining more than 128,000 characters. These values can no longer be stored as one character per byte, so each character needs to be encoded using multiple bytes per character. The two most popular encoding methods are UTF-8 and UCS-2. UTF-8 is commonly used on UNIX systems and Internet data; UCS-2 is the common encoding for Microsoft applications. Integrator can read or write double-byte character data using either the UTF-8 or UCS-2 encodings; internally, it processes data as UTF-8. Integrator supports handling and writing out Unicode supplementary characters (4-byte UTF-8 characters).

To maintain compatibility with old scripts that process Latin1 data, Integrator defaults to treating input and output data as Latin characters. A script written to process Latin1 data will continue to work automatically without any changes. To handle Unicode data, input objects and output objects have an encoding attribute that defines the encoding used to read or write data as follows:

If this attribute has the value "ascii" or "latin1", then the data will be treated as latin1 characters.
If the value is a unicode encoding ("unicode" for UCS-2, "utf-8" for UTF-8 data), then the data will be processed as Unicode characters.

To avoid having to set the encoding for every input or output object in a task, Integrator detects whether any object in a task processes Unicode characters or whether the Integrator script itself is declared to be written as UTF-8 (that is charset 1208 is set). If so, the default encoding for that task becomes UTF-8. Any object that does not define an encoding will assume the data is UTF-8. With this approach, an Integrator developer can set the encoding in the output object to be UTF-8, and all input objects for that task will default to be reading UTF-8 data. This can be overridden by setting an encoding attribute in the input objects.