Build Code Lookup Samples
The lookup operation for a Build script can be refined in many ways. The basic pattern is:
lookup { <input> <keys> <columns> <duplicates> }
The <input> can be any input type, including cbase-input for cBases, text-input for text files, and dive-input for Dives. The input itself can have operations inside, for example: sort, filter, remove, and lookup. You can even lookup inside a lookup's input. Lookup is itself an operation, so you can also do a remove before or after the lookup.
Here is the extended pattern:
<operations> lookup { <cbase-input or text-input or dive-input> { <operations> } <keys> <columns> <duplicates> } <operations>
Here are some example scenarios.
This lookup connects as "keys" any columns whose names match, and then brings all other columns in as lookups.
lookup { cbase-input "other.cbase" }
After removing "A" from other.cbase, this lookup connects as "keys" any columns whose names match, and then brings all other columns in as lookups.
lookup { cbase-input "other.cbase" { remove "A" } }
This lookup adds columns "B" and "C", using "A" as the key.
lookup { cbase-input "other.cbase" key "A" "A" column "B" column "C" }
This lookup flags the columns that are brought into the build as Info Fields for ProDiver. This is done by including the attribute keys-are-required-dimensions in the lookup block. For example:
build { text-input "file1.txt" { column "Customer" type="string" column "Product" type="string" column "Sales" type="integer" column "CustGrp" type="string" required-dimension="Customer" } text-input "file2.txt" { column "Customer" type="string" column "Product" type="string" column "Budget" type="integer" column "CustGrp" type="string" required-dimension="Customer" } lookup { text-input "CustomerGroup.txt" { column "Customer" type="string" column "Customer Group" } key "Customer" "Customer" keys-are-required-dimensions } output "qa-test.cbase" }
Here you can see that ProDiver has set the info from the lookup as a dynamic dimension.
This lookup adds column "B", using "A" as the key. Removing column "C" in the input block is the preferred use of the syntax.
lookup { cbase-input "other.cbase" { remove "C" } key "A" "A" column "B" }
This lookup flags column B as optional, that is, not a suggested dimension.
lookup { text-input "lookup.txt" key "A" "A" column "B" suggested-dimension=false }
This lookup flags the columns other than B as optional. The columns can still be promoted for diving, but they are not dimcounted for the console.
lookup { text-input "lookup.txt" { include-other-columns suggested-dimension=false } key "A" "A" column "B" }
Suppose you have a column A in your input, and you want to bring in a possibly updated version of A from a lookup table. You could approach it as follows:
- Perform the lookup, but bring in the column under a new name such as A_new.
- Replace column A using the coalesce() function to combine the new and old value. Note that coalesce() returns the first non-null value it sees.
- Remove column A_new.
build { text-input "data.txt" { column "Key" type="string" column "A" type="string" } lookup { text-input "lookup.txt" { column "Lookup Key" type="string" column "A" type="string" rename "A" "A_new" } key "Key" "Lookup Key" column "A_new" } replace "A" `coalesce(value("A_new"), value("A"))` remove "A_new" output "data.cbase" }
Here's why this works.
When running the lookup:
- A given key might have no match. In this case, A_new is null, and so coalesce() returns A and has no effect.
- A given key does have a match, and it is not null. Then coalesce() returns this new value, A_new.
- A given key does have a match, but it is null. Then coalesce() returns the original A, and has no effect.
That is how a Lookup of join type update works when using the Data Integrator.
You can even specify alternate null values to treat as null, the same as the Integrator Lookup object does when the attribute update_null_value is used:
text-input "lookup.txt" { column "Lookup Key" column "A" type="string" { nulls "XXX" } rename "A" "A_new" }
The input can be a simple or a more complicated text file. The text file needs to start with a row of column headers.
lookup { text-input "lookup-color.txt" key "Color" "Color" column "Wavelength" column "Hex" }
A text file can include a range of values to use as part of the lookup process. For a lookup row to match, the main table's key value must be between the begin-range and end-range values, inclusively. The begin-range and end-range values can be null, in which case the range is open on either side (or both sides). Range lookup keys can be any type except string.
lookup { text-input "/data_sources/key_lookup_department_code.txt"{ limit-similar-values-warnings 0 column "Value" type="string" rename "Value" "Department Code" remove "Column Name" remove "Swim Lane" column "Eff Start Date" type="date" format="YYYY-MM-DD" column "Eff End Date" type="date" format="YYYY-MM-DD" column "Special Nursery Dept Key" type="string" suggested-dimension=false include-other-columns } range-key "Posting Date" "Eff Start Date" "Eff End Date" key "Department Code" "Department Code" }
Results of a dive can be used to build the cBase, while the results of another dive can be used for lookup purposes.
build { dive-input "/client-files/Monthly-Dive.dive" { } lookup { dive-input "/client-files/Master-Dive.dive" { } key "Drug" "Drug" } output "/cbases/Monthly-Data-with-drugs.cbase" }
The following sample Build script shows an example of a dive within a lookup. The first lookup is a regular lookup to get info values. The second lookup dives into another cBase to bring in summarized data. For example:
build { cbase-input "accounts_initial.cbase" lookup { text-input "days_since_last_discharge.txt" { column "Account ID" type="string" column "Last Discharge Date" type="date" format="YYYY-MM-DD" column "Days Since Last Discharge" type="integer" } key "Account ID" "Account ID" } lookup { dive-input { dive { cplan { cbase-input "charge_detail_initial.cbase" } window { dimension "Account ID" column "Charge Quantity" column "Charge Amount" } } } key "Account ID" "Account ID" } output "accounts.cbase" }
The accounts and charges data sets have "Account ID" in common, but the charges data set has more detail, with one record per charge for the account. In order to bring charge information into the accounts data set, you need to summarize it—which is what the dive does.
The dive can be extracted and run separately.
This example sorts the data in order to identify duplicates, filters them out, and then adds a column to flag new items.
lookup { text-input "/data/patient-data-*.dat" { column "Patient URN" type="string" sort { column "Patient URN" } add "dupe" ``` if(on_previous_row(value("Patient URN")) !=value("Patient URN"),"N","Y")``` filter `value("dupe")="N"` remove "dupe" add "New File Flag" `"Y"` } key "Patient URN" "Patient URN" }
Sometimes lookup tables have multiple rows with the same key.
Consider this small data set listing 5 presidential names. You want to map the president's name to the number of his presidency.
The lookup source could be like the following table (with two entries for Cleveland):
You can add the tag resolve-duplicates to return either value (any), the first value, or the last value. For example, this Dive would return 22 for president Cleveland:
dive { cplan { text-input "president-names.txt" lookup { text-input "presidents.csv" key "Name" "Name" column "Presidency" resolve-duplicates "first" } } window { dimension "Name" column "Presidency" } }
The proper way to nest lookups is to put the inner lookup (and here the `add` it depends on) under the input tag of the outer lookup. For example:
build { text-input "input.txt" { column "A" type="string" column "B" type="string" } lookup { text-input "lookup_1.txt" { column "A_lookup" type="string" column "C" type="string" add "D" `"D"` lookup { text-input "lookup_2.txt" { column "D_lookup" type="string" column "E" type="string" } key "D" "D_lookup" } } key "A" "A_lookup" } output "my-output.cbase" }
You can merge or concatenate two sources of the same type for a lookup using a consolidate block.
lookup { consolidate { text-input "lookup-A1.txt" text-input "lookup-A2.txt" } key "A" "A" column "B" }
lookup { consolidate { cbase-input "lookup-pre.cbase" cbase-input "lookup-post.cbase" } key "A" "A" column "B" }
NOTE:
- Lookups are case sensitive.
- For lookups, null is a valid key.
- If <unknown> appears as a key for a row, then <unknown> is used for all lookup columns for that row. Best practice is to not have key columns in the base table that contain the value <unknown>.
- By default, when a build encounters duplicate lookup keys with identical values, that value is used. Otherwise, a warning is issued due to the ambiguity, and <unknown> is used instead. There is only one warning issued per lookup. Use the resolve-duplicates tag to change this behavior. See General Build Tags.
See also: