XML Input Object

The Integrator XML input object allows Integrator to extract tabular data from XML (Extensible Markup Language) documents. An XML document describes a tree of elements and attributes. Since Integrator processes tabular data, the XML input object extracts data from this tree in the form of rows and columns. XML is defined by the World Wide Web Consortium (W3C) through the recommendation at http://www.w3.org/TR/xml.

Integrator uses the XPath query language to identify a set of nodes that will become rows, and uses XPath expressions to identify the column values relative to these nodes. This technique allows a large number of different XML documents to be accessed and converted into tabular format. XPath is also defined as a W3C recommendation: http://www.w3.org/TR/xpath20.

Although it is a fairly powerful expression language, the simplest of expressions should be able to extract most data from an XML document. The abbreviated syntax should be sufficient for most purposes.

The XML input object reads the entire XML file into memory in order to search and access the nodes quickly. When processing large XML input files, you may want to stage them into tab-delimited files before continuing with other processing.

XML Attributes

Attribute Type Description
input_type
(required)
String Identifies the object as an XML input object. The value of this string is "XML".
input_type = `xml`,
filename String Defines the file name for the XML input file.
filename = `..\\DI_Projects\\xml\\invoice.xml`,
match String

Defines an XPath expression to match the nodes that will be returned as rows of input. Each matching node will result in a row being returned for that node. The column values are defined using the columns attribute described below.

The match string can be thought of as specifying a hierarchy of elements reaching down to the element that should be represented as a row in the output flow. For example, if you have an XML document with elements "Company", "Department" and "Employee", you can request a table of Employees by specifying an XPath expression as follows:

match = `/Company/Department/Employee`,

You can also can request a table of Employees by skipping levels and directly specifying the desired element as follows:

match = `//Employee`,

If the document you are reading defines a default namespace, you cannot refer to elements using just a simple reference such as /Company or /Employee. A prefix is required. See XML Namespaces for more information.

columns Array of Strings or Arrays

Defines a set of XPath expressions and names to be used as column values and column names. Each element in this array is either a string (which defines an XPath expression), or a 2-element array defining an XPath expression and the corresponding column name. If only a string is specified, the column name is taken from the last identifier contained in the XPath expression.

For example, if the XPath expression is:

"./Last Name"

the column name will be:

"Last Name"

If the XPath expression is:

"../../@Company"

the column name will be:

"Company"

In many cases, the final identifier in an expression should be unique and descriptive enough to act as a column name. In the case of duplicate element names in the schema, or complex XPath expressions, the column name can be specified. If the XPath expression does not end in an identifier (for example, "." or ".."), the column name must be specified.

The XPath expression for the column is evaluated relative to the node matched by the match query that defines rows. If a column XPath expression matches multiple nodes, only the first one is used as the column value.

The abbreviated XPath syntax is very similar to file system relative paths. Using the Employee example above, the following XPath expressions will refer to elements relative to the Employee node:

"./Last_Name"

refers to a <Last_Name> element under the <Employee> node.

"../Department_Name"

refers to a <Department_Name> element under the <Department> node.

"./@active"

refers to an attribute named "active" in the <Employee>.

An example of a columns array would be:

columns = { "./Last_Name",

"../Department_Name",

"@active",

{"../../Name", "Company Name" }

},

warn_on_empty Boolean

Indicates when "true", that Integrator displays a warning if no nodes are matched by the match XPath expression. If this attribute is "false", these warnings are suppressed. This attribute defaults to "true".

NOTE: This attribute is Warn_On_Empty in Visual Integrator.

aliases Array of Strings

Defines new column names for the columns already defined in the input data. Format is "oldname=newname". Blanks before or after the columns names will be ignored. Spaces within a column name are acceptable. If Iis blank, then the given column is deleted from the output flow.

NOTE: This attribute is Alias Lines in Visual Integrator.

prefix String Defines a prefix that is prepended to all columns in the flow that are not aliased using the aliases array. If you want a space between the prefix and the column name, include that space in the prefix string definition.
first Integer Specifies the number of records to be read from the input file. This limit is particularly useful for script testing on a small number of input records. If used, Integrator reads up to the specified number of records. If not used, all rows are returned.
keep_columns Array of Strings Defines a list of columns to be kept by the input object. If this attribute is not used, all columns are kept. The output flow of the object is limited to those columns that are listed, and no excluded columns are available to subsequent process objects. Column names in the keep_columns array should be given after they are aliased or prepended with the prefix string.
rename_duplicates Boolean Creates new column names for duplicate columns names that appear in the input flow for this object. Subsequent columns for a column with the same name as a column name will be given the names name_2 ... name_(n) based on the positional order in the input. If, for some reason, a column in the input flow already has this name, that number will be skipped.

For example, if the input flow already has a column named "DESC_2", the object will name the duplicate column DESC as "DESC_3". The duplicate naming process occurs before attributes defining aliases, prefixes or the columns to keep are applied, so these generated column names can be aliases to another name.

NOTE: This attribute is Rename Duplicates in Visual Integrator.

trace_after Sub-object

Traces data flows leaving the specified object, which makes debugging scripts easier. This is equivalent to adding a Trace process object immediately after the current object.

See Embedded Trace Object for more on using trace sub-objects.

NOTE: When using the XML input object, the XML library returns data as utf-8, even if the input encoding is latin1. Once the data flow becomes utf-8, that is how it will be passed along to the rest of the task. Be sure to perform a conversion on the output object if you need to move back to latin1.