Summary Functions

Spectre summary functions operate on columns of data, rather than rows, to produce a single value. They can be used when doing a dive.

any

Evaluate the specified expression on a randomly selected row, and return that value as the result.

any(expression : any) : any

The order of rows is explicitly undefined. You cannot rely on the function to return the same result, even if it appears to work in small tests. When this function runs in parallel, it might pick a different row than when not run in parallel. The function might pick different rows on different runs. See also info.

For example:

`any(value("Address"))`returns the value of the "Address" column in a randomly selected row

argmax

Evaluate the specified expression and return the dimension value that is in the same row as the maximum result of the given expression.

argmax(expression : any, dimension : string) : any

For example:

`argmax(calc("Population"), "State")` returns California

argmin

Evaluate the specified expression and return the dimension value that is in the same row as the minimum result of the given expression.

argmin(expression : any, dimension : string) : any

For example:

`argmin(calc("Population"), "State")` returns Wyoming

average

Evaluate the specified expression on all rows, and return the average. The average function ignore nulls.

average(expression : numeric) : double

For example:

`average(value("Price"))` returns the average for the "Price" column

const

Evaluate the specified expression on a randomly selected row, and return that value as the result. This is exactly like any except that it does not get sanitized out of totals.

const(expression : any) : any

Use this if you have a column that you know is constant and just want its only value. It does not verify that the column is in fact constant; it just returns a value and is done.

For example:

`const(value("Extract Date"))` returns the value from one row in the "Extract Date" column

count

Return the number of rows in the cBase or cPlan. This function does not apply to multilevel contexts.

count() : integer

For example:

`count()` returns the number of rows in the working set

`filter(count(), value("State") = "Ohio")` returns the number of rows where "State" is "Ohio"

first

Evaluate the specified expression on the first record in the working set.

first(expression : any) : any

For example:

`first(value("Charge"))` returns the first charge encountered for the dimension value in the data set

geomean

Evaluate the specified expression on all rows and return the geometric mean.

geomean(expression : numeric) : double

For example:

`geomean(value("Price"))` returns the geometric mean value of the price column

harmean

Evaluate the specified expression on all rows and return the harmonic mean.

harmean(expression : numeric) : double

For example:

In a table where column "X" has values 1, 2, 3, 4, 5, `harmean(value("X"))` returns 2.189781

hyperloglog

Return an estimate of the number of unique combinations of v1, v2, and so on.

hyperloglog(precision : integer, expression1 : any, expression2 : any, ... : any) : integer

The precision argument must be between 4 and 16. Higher precision means the results are closer to the actual dimcount, but the calculation uses more memory.

For example:

calc "Customer Count" `hyperloglog(8, value("Customer Name"))` returns an estimate of unique customers

calc "Cust Product Count" `hyperloglog(8, value("Customer Name"),value("Product Name"))` returns an estimate of the unique combinations of customer and product

info

Determine if the specified expression has the same result on every row and return that value; otherwise, return an error.

info(expression : any) : any

Evaluate the specified expression on a randomly selected row (which one is not defined, as with any) and return the value. Then, evaluate the expression on all rows and log errors for those rows where the value is different than the picked value.

For example:

`info(value("Address"))` returns the value found in the "Address" column if all values are the same. If all values are not the same, returns an error.

intercept

Calculate the point at which a line will intersect the y-axis by using existing y-values and x-values. The intercept point is based on a best-fit regression line plotted through the known y-values and known x-values.

intercept(y-value: numeric, x-value: numeric) : double

For example:

`intercept(value("Y"), value("X"))` returns 1.0 with the indicated Y and X values.

Sample Y and X values for intercept()

`intercept(value("Y"), value("X"))` returns 0.0 with the indicated Y and X values.

data values

`intercept(value("Y"), value("X"))` returns -9.0 with the indicated Y and X values.

data values

kahan_sum

Evaluate the specified expression on every row and add the results. This function is the same as sum, but with countermeasures against floating point imprecision.

kahan_sum(expression : double) : double

If you sum millions of floating point values, the imprecision inherent in that calculation can become noticeable. If that happens, use the kahan_sum function instead,which is slower than sum but reduces the occurrence of this problem.

For example:

This is mostly the same as sum(), with very rare exceptions, but in a table where column "X" has values 1.001, 2.002, 3.003, 4.004, 5.005, `kahan_sum(value("X"))` returns 15.015.

kurtosis

Evaluate the specified expression on all rows and return the sample excess kurtosis, a measure of whether the data are peaked or flat relative to a normal distribution.

kurtosis(expression : numeric) : double

For example:

In a table where column "X" has values 1, 2, 3, 4, 5, `kurtosis(value("X"))` returns -1.2

last

Evaluate the specified expression on the last record in the working set.

last(expression : any) : any

For example:

`last(value("Direction"))` returns the last direction encountered for the dimension value in the data set

max

Two variants exist: one for a row, which is a non-summary math expression, and another for a column, which is a summary expression. The parameters determine which variant applies.

Return the largest value of a list of numbers.

max(value1 : any, value2 : any, ... : any) : any

For example:

`max(100, 300, 500)`returns 500

Evaluate the specified expression on all rows in the current working set and return the maximum value.

max(expression : any) : any

For example:

`max(value("Price"))`returns the maximum value in the "Price" column

`max(value("Ship Date"))` returns the maximum Ship Date for the current working set. If a dive is on a particular customer, this expression returns the last ship date for that customer only.

NOTE: See also the max Math function.

median

Evaluate the specified expression on all rows and return the median, which is the middle number in a sorted list of numbers.

median(expression : numeric) : double

For example:

`median(value("Salary"))`returns the median value from the Salary column

min

Two variants exist: one for a row, which is a non-summary math expression, and another for a column, which is a summary expression. The parameters determine which variant applies.

Return the smallest value of a list of numbers.

min(value1 : any, value2 : any, ... : any) : any

For example:

`min(100, 300, 500)`returns 100

Evaluate the specified expression on all rows, and return the minimum value.

min(expression : any) : any

For example:

`min(value("Price"))`returns the minimum value in the price column

NOTE: See also the min Math function.

mode

Evaluate the specified expression on all rows, and return the value that occurs most often.

mode(expression : any) : any

The mode() function returns null if all values occur exactly once. Also, if there are nulls in the data and null occurs most often, the function returns null.

For example:

Sample Dive using mode()

Using this data:

Data for mode() example

Yields these results:

Sample Result using mode()

mul

Evaluate the specified expression on every row and multiply the results.

mul(expression : any) : any

For example:

In a table where column "X" has values 1, 2, 3, 4, 5, `mul(value("X"))` returns 120

percentile

Evaluate the specified expression on all rows and return the specified percentile. To calculate the percentile, Spectre uses an algorithm that does linear interpolation of closest ranks.

percentile(expression : numeric, percentile : numeric, predicate : boolean) : double

For example:

`percentile(value("Units), 75)` looks at all of the Units values in the working set, and returns the value at the 75th percentile. If you sorted the Units in the working set from highest to lowest, the expression returns the value three-fourths of the way down the list.

`percentile(value("Units"), 50)` is equivalent to `median(value("Units"))`

`filter(percentile(value("Units"), 75), value("Units") != 0)` filters out zero units from the working set before calculating the 75th percentile

skewness

Evaluate the specified expression on all rows and return the sample skewness, a measure of symmetry or lack thereof.

skewness(expression : numeric) : double

slope

Compute the slope of the linear regression line through data points in known y-values and known x-values. The slope is the vertical distance divided by the horizontal distance between any two points on the line, which is the rate of change along the regression line.

slope(y-value: numeric, x-value: numeric) : double

For example:

`slope(value("Y"), value("X")` gives 0.428571429 with the indicated Y and X values.

Sample Y and X values for slope()

`slope(value("Y"), value("X")` gives 0.6 with the indicated Y and X values.

data values

`slope(value("Y"), value("X")` gives 1.0 with the indicated Y and X values.

data values

squash

Eliminate records by summarizing values. The summary expression is run on all records in the window. You usually call the squash function twice. The first call breaks the working set out by a dimension, and the second call determines the final value.

squash(expression : any, ... :any) : any

The first argument is the summary expression to run on the side table. The second argument is a column or dimension name to multitab the side table on.

For example:

`squash(max(squash(calc("X"), "D")))` computes calc("X") for each value of "D" (as if you made a window on "D"), and then returns the max of those "X" values. This is the equivalent of DimMax[<dimension>,<summary>] in ProDiver.

Squashing on a filter

calc "Solid Doses Sedative" ```squash(sum(squash(filter(calc("Solid Quantity"),

value("BJA Class") = "Sedative"

and dimcount("Prescriber Name-DEA") >= 5

and dimcount("Pharmacy Name-DEA") >= 5),

"ID")))``` returns the number of solid dose sedatives that were prescribed to patients with more than 5 prescribers and more than 5 pharmacies.

When used in a cPlan, this calc says:

From here, dive on "ID" (the inner squash(... "ID"))
For each "ID", compute "Solid Quantity" (the calc(...)) for each record where "BJA Class" is "Sedative" and where that "ID" has at least 5 of the two dimcounts (the filter(...))
Sum all of those computed "Solid Quantity" values (squash(sum(...)))

Squashing on a conditional

Suppose you have a dimension ID, and a dimension Source, and every ID should come from a single Source. You want to compute a single number that tells you if every ID has a single Source value associated with it. You could write a dimensionless dive that computes the number you want, using squash() to dive off each ID and check the dimcount of Source. For example:

dive {
  cplan {
    text-input "squashdata.txt"
  }
  
  window {
    column "Conflicted Count" `squash(sum(squash(if(dimcount("Source") > 1, 1, 0), "ID")))`
  }
}

This gives the number of IDs which have more than one Source.

squash_right

Eliminate records by summarizing values from the right. The summary expression is run on all records in the window.

squash_right(expression.: any, number : integer) : any

For example:

`squash_right(sum(value("Rev")), 2)` squashes records by eliminating values from the right. The "2" is the number of dimensions to roll up and eliminate from the right in the multitab. This value defaults to 1.

stddev

Evaluate the specified expression on all rows and return the Sample Standard Deviation.

stddev(expression : numeric) : double

For example:

`stddev(value("Value"))` returns the standard deviation for the sample values

stddevp

Evaluate the specified expression on all rows and return the Population Standard Deviation.

stddevp(expression : numeric) : double

For example:

`stddevp(value("Value"))` returns the standard deviation for the population values

string_agg

Available in Spectre 7.2(7) and later. Aggregates all values of a column that have a matching dimension. Returns one row of aggregated values for each unique dimension value. The delimiter and last delimiter parameters are optional and default to a comma. The last delimiter parameter defines the delimiter that separates the last two elements of the aggregated values. The include nulls parameter is optional and defaults to false. When the include nulls parameter is true, any null values are aggregated along with non-null values.

string_agg(expression : string, delimiter : string, last delimiter : string, include nulls : boolean)

For example:

column "Aggregated" `string_agg(value("Child"), ",", " and ")` using this data:

yields these results in the new column Aggregated.

sum

Evaluate the specified expression on every row and add the results.

sum(expression : numeric) : numeric

For example:

`filter(sum(value("Revenue")), value("State") = "Ohio")`returns the sum of "Revenue" where "State" is "Ohio"

`sum(value("Revenue"))` returns the total for the Revenue column

`sum(value("Price") * value("Units"))` returns the total of the calculation on each row of the data at run-time

ucount

Evaluate the specified expression on all rows and count the number of different values.

ucount(expression : any) : integer

For example:

`ucount(month(value("Order Date")))`returns the number of months found in the "Order Date" column

`ucount(value("Customer"))` is equivalent to `dimcount("Customer")`

`ucount(period("gregorian month", value("Sold Date")))` returns the number of months represented by dates in the "Sold Date" column

NOTE: The ucount function can be used with any expression, while dimcount can only be used with dimensions.

weighted_median

Evaluate the specified summary on all rows and return the given weighted median.

weighted_median(expression: numeric, weight: numeric) : double

For example:

column "Weighted Median" `weighted_median(value("Rent"), value("Number of Units"))` using this data:

Weighted Median Data Set

yields these results by the Bedroom Type:

Weighted Median Results

weighted_percentile

Evaluate the specified expression on all rows and return the specified weighted percentile.

weighted_percentile(expression: numeric, weight: numeric, percentile : numeric) : double

For example:

column "Weighted Percentile" `weighted_percentile(value("Rent"), value("Number of Units"), 80)` returns the following using the weighted median example:

Weighted Percentile example