XML Input Object
The Integrator XML input object allows Integrator to extract tabular data from XML (Extensible Markup Language) documents. An XML document describes a tree of elements and attributes. Since Integrator processes tabular data, the XML input object extracts data from this tree in the form of rows and columns. XML is defined by the World Wide Web Consortium (W3C) through the recommendation at http://www.w3.org/TR/xml.
Integrator uses the XPath query language to identify a set of nodes that will become rows, and uses XPath expressions to identify the column values relative to these nodes. This technique allows a large number of different XML documents to be accessed and converted into tabular format. XPath is also defined as a W3C recommendation: http://www.w3.org/TR/xpath20.
Although it is a fairly powerful expression language, the simplest of expressions should be able to extract most data from an XML document. The abbreviated syntax should be sufficient for most purposes.
The XML input object reads the entire XML file into memory in order to search and access the nodes quickly. When processing large XML input files, you may want to stage them into tab-delimited files before continuing with other processing.
XML Attributes
| Attribute | Type | Description |
|---|---|---|
| input_type (required) |
String | Identifies the object as an XML input object. The value of this string is "XML". input_type = `xml`, |
| filename | String | Defines the file name for the XML input file. filename = `..\\DI_Projects\\xml\\invoice.xml`, |
| match | String |
Defines an XPath expression to match the nodes that will be returned as rows of input. Each matching node will result in a row being returned for that node. The column values are defined using the columns attribute described below. The match string can be thought of as specifying a hierarchy of elements reaching down to the element that should be represented as a row in the output flow. For example, if you have an XML document with elements "Company", "Department" and "Employee", you can request a table of Employees by specifying an XPath expression as follows: match = `/Company/Department/Employee`, You can also can request a table of Employees by skipping levels and directly specifying the desired element as follows: match = `//Employee`, If the document you are reading defines a default namespace, you cannot refer to elements using just a simple reference such as /Company or /Employee. A prefix is required. See XML Namespaces for more information. |
| columns | Array of Strings or Arrays |
Defines a set of XPath expressions and names to be used as column values and column names. Each element in this array is either a string (which defines an XPath expression), or a 2-element array defining an XPath expression and the corresponding column name. If only a string is specified, the column name is taken from the last identifier contained in the XPath expression. For example, if the XPath expression is: "./Last Name" the column name will be: "Last Name" If the XPath expression is: "../../@Company" the column name will be: "Company" In many cases, the final identifier in an expression should be unique and descriptive enough to act as a column name. In the case of duplicate element names in the schema, or complex XPath expressions, the column name can be specified. If the XPath expression does not end in an identifier (for example, "." or ".."), the column name must be specified. The XPath expression for the column is evaluated relative to the node matched by the match query that defines rows. If a column XPath expression matches multiple nodes, only the first one is used as the column value. The abbreviated XPath syntax is very similar to file system relative paths. Using the Employee example above, the following XPath expressions will refer to elements relative to the Employee node: "./Last_Name" refers to a <Last_Name> element under the <Employee> node. "../Department_Name" refers to a <Department_Name> element under the <Department> node. "./@active" refers to an attribute named "active" in the <Employee>. An example of a columns array would be: columns = { "./Last_Name", "../Department_Name", "@active", {"../../Name", "Company Name" } }, |
| warn_on_empty | Boolean |
Indicates when "true", that Integrator displays a warning if no nodes are matched by the match XPath expression. If this attribute is "false", these warnings are suppressed. This attribute defaults to "true". NOTE: This attribute is Warn_On_Empty in Visual Integrator. |
| aliases | Array of Strings |
Defines new column names for the columns already defined in the input data. Format is "oldname=newname". Blanks before or after the columns names will be ignored. Spaces within a column name are acceptable. If Iis blank, then the given column is deleted from the output flow. NOTE: This attribute is Alias Lines in Visual Integrator. |
| prefix | String | Defines a prefix that is prepended to all columns in the flow that are not aliased using the aliases array. If you want a space between the prefix and the column name, include that space in the prefix string definition. |
| first | Integer | Specifies the number of records to be read from the input file. This limit is particularly useful for script testing on a small number of input records. If used, Integrator reads up to the specified number of records. If not used, all rows are returned. |
| keep_columns | Array of Strings | Defines a list of columns to be kept by the input object. If this attribute is not used, all columns are kept. The output flow of the object is limited to those columns that are listed, and no excluded columns are available to subsequent process objects. Column names in the keep_columns array should be given after they are aliased or prepended with the prefix string. |
| rename_duplicates | Boolean |
Creates new column names for duplicate columns names that appear in the input flow for this
object. Subsequent columns for a column with the same name as a column name will be given
the names name_2 ... name_(n) based on the positional order in the input. If, for some reason, a
column in the input flow already has this name, that number will be skipped.
For example, if the input flow already has a column named "DESC_2", the object will name the
duplicate column DESC as "DESC_3". The duplicate naming process occurs before attributes
defining aliases, prefixes or the columns to keep are applied, so these generated column names
can be aliases to another name. NOTE: This attribute is Rename Duplicates in Visual Integrator. |
| trace_after | Sub-object |
Traces data flows leaving the specified object, which makes debugging scripts easier. This is equivalent to adding a Trace process object immediately after the current object. See Embedded Trace Object for more on using trace sub-objects. |
One aspect of XML that may affect the operation of the XML input objects and the use of XPath is the idea of XML namespaces. If the document you are reading defines a default namespace with the xmlns attribute, you will need to take that into account when writing your XPath expressions. XML uses namespaces to identify elements uniquely across different object definitions.
XML elements may define namespaces using the xmlns attribute using one of the following formats:
<root xmlns:h="http://www.w3.org/TR/html4/">
<root xmlns="http://www.w3.org/TR/html4/">
In the first format, the namespace "http://www.w3.org/TR/html4/" is defined with the prefix "h". In the second format, the namespace "http://www.w3.org/TR/html4/" is defined as the default namespace.
If a document does not have a default namespace, any element without a prefix (for example, <Employee>) is considered to have no namespace. XPath can refer to these elements using a simple reference, such as "//Employee".
If the document you are reading defines a default namespace, you cannot refer to elements using just a simple reference (/Company, /Employee, etc.). They must have a prefix in an XPath expression to identify them as part of that namespace, even though the original namespace does not have a prefix.
To accommodate this situation, the XML input object will check the root node of the document and see it defines a default namespace. If it does, it will define a prefix "_" for XPath that refers to this namespace for use in XPath names. In this situation, element names should be preceded by "_:" in the XPath expressions. For example:
match = "//_:Employee"
and
columns = {"./_:Last_name",
"../_:Department_Name"
}
This should be sufficient to return the proper data.
NOTE: When using the XML input object, the XML library returns data as utf-8, even if the input encoding is latin1. Once the data flow becomes utf-8, that is how it will be passed along to the rest of the task. Be sure to perform a conversion on the output object if you need to move back to latin1.