Spectre Build Warnings

When you initially run a Spectre Build simply using the input file, certain warning messages are common as Spectre automatically determines the data types. The Spectre build can deduce the types of most columns, but data types cannot always be determined definitively. Therefore, when building cBases, it is prudent to review all warning messages to determine whether you need to modify your build script.

Informational messages versus warnings

Build messages can be purely informative, while other messages indicate issues that can impact the build performance and your data in the cBase. For example, a message such as "Patient Name O'Brien is similar to Patient Name OBrien" lets you know that Spectre noted similar names. On the other hand, the following messages indicate that the Spectre build is attempting to automatically determine the data type and might not be operating as efficiently as possible:

  • "One or more values have leading decimal"
  • "One or more values have trailing zeros"
  • "A column has more than 1000 non-canonical numbers"

When a Build uses a lookup and encounters a duplicate key with different values, it randomly selects a value and generates a "Duplicate lookup keys" warning.

Some build warnings are conditional on an input having more than one row.

String, numeric, and date data types

Some columns potentially can be more than one data type. For instance, postal code columns should be stored as strings, but the build treats them as integers. A build warns about numeric columns with unusual formatting, such as leading or trailing zeros, so you need to determine the cause of the warnings. When dealing with numbers, Spectre build considers commas (',') and decimal points ('.') equivalent regarding numeric columns. For example, Spectre parses a "fixed100" type column containing values like "1,23" and "1.23", the same regardless of the system locale.

Date and period columns are similar—the build interprets them as strings, but it warns about columns that look like they might have certain date or period formats and even suggests syntax to process them as such.

How Spectre processes columns with undefined data types

You can eliminate certain warnings and improve the efficiency of the build by specifying the data types of the columns. The following examples give more detail about the processing that Spectre must do to handle columns with undefined data types:

  • If Spectre needs to determine the column data type, Spectre must be able to handle the possibility that later on in the file it might encounter a value that does not fit the current assumption. Imagine that the data column has included "1", "2.5", and "3.50", so Spectre assigns a data type of fixed100 for that column. But, what if the next value is "1.2.3", which must be a string? Spectre needs to be ready to turn all of the previously-read values into strings.
  • The values "1" and "2.5" are easy to convert to strings because those numbers are identical to the way they are read: "1" and "2.5". Spectre calls these string representations "canonical." For other representations such as "3.50", ".1", and "04", Spectre stores the exact input, so that the values are accurately represented if it turns out that the column is string data. Storing the original strings takes more memory.
  • The "trailing zeros", "leading zeros", "trailing decimal", and "leading decimal" warnings all indicate that Spectre is using memory to store the original values in case the columns are later determined to be numeric. Correctly specifying the column data type reduces this memory usage.
  • The "more than 1000 non-canonical" warning is an indicator that a significant list of string values is being stored. If the build script specifies that the column data is of type fixed100, Spectre does not need to store the non-canonical representations, saving time and memory.
  • Note that most of the time these warnings do not indicate Spectre is losing data. However, in some cases, if the column is supposed to be type string but Spectre does not correctly determine the data type, then it might drop significant digits. For example, suppose that the column is a Product ID string with distinct values "12", "012", and "000012". If Spectre instead determines that the column is an integer, then Spectre drops the leading zeros and the three different values all become "12".

See also: