Skip to content

[SPARK-57715][SQL] PIVOT count() empty buckets should return 0 instead of NULL#56808

Open
miland-db wants to merge 3 commits into
apache:masterfrom
miland-db:milan-dankovic_data/pivot-fix
Open

[SPARK-57715][SQL] PIVOT count() empty buckets should return 0 instead of NULL#56808
miland-db wants to merge 3 commits into
apache:masterfrom
miland-db:milan-dankovic_data/pivot-fix

Conversation

@miland-db

@miland-db miland-db commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

PIVOT (count(val) FOR cat IN (...)) returned NULL instead of 0 for pivot categories with no matching rows. This differs from standard SQL behavior and from systems such as DuckDB and SQL Server, which return 0.

Root cause: the fast path (PivotFirst + ExtractValue) leaves the slot of an empty pivot bucket unset, so ExtractValue returns NULL. The slow path rewrites each aggregate as aggFunc(if(pivotCol <=> value, child, NULL)) and is already correct: count(NULL) is 0.

Why are the changes needed?

This is a correctness bug, present since pivot was introduced.

Does this PR introduce any user-facing change?

Yes. PIVOT with count now returns the aggregate's empty-input value (e.g. 0) instead of NULL for pivot categories with no matching rows.

How was this patch tested?

New tests and golden files.

Was this patch authored or co-authored using generative AI tooling?

No.

@miland-db miland-db changed the title Fix PIVOT [SQL] PIVOT count() empty buckets should return 0 instead of NULL Jun 26, 2026
@miland-db miland-db changed the title [SQL] PIVOT count() empty buckets should return 0 instead of NULL [SPARK-57715][SQL] PIVOT count() empty buckets should return 0 instead of NULL Jun 26, 2026
@miland-db

Copy link
Copy Markdown
Contributor Author

Adding @cloud-fan

@MaxGekk MaxGekk left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 blocking, 2 non-blocking, 0 nits. Sound fix for the bare-count case; one correctness gap in the composite branch plus a dead-Coalesce nit.

Correctness (2)

  • The fast-path Coalesce(extractedValue, default) assumes "fast-path NULL ⟺ empty bucket", but PivotFirst.update stores nothing for null values and keeps no presence bit, so a present-but-NULL composite bucket is indistinguishable from an empty one. Composites like nullif(count(v), 5) or if(count(v1) > 0, sum(v2), 0) then return the empty-default instead of NULL — a silent wrong result (flag on by default), plus a pivoted column declared non-nullable that can in fact be NULL. See inline on PivotTransformer.scala.
  • No test covers a composite aggregate that is NULL on a non-empty bucket (the case above); the composite tests use only never-NULL shapes. See inline on DataFramePivotSuite.scala.

Suggestions (1)

  • The case Literal(null, _) => None guard misses null-evaluating non-literal defaults (CEIL(null), null + 1), so redundant Coalesce(x, <expr→null>) wrappers are emitted (visible in the goldens). Harmless to results but dead plan nodes — fold the default first. See inline on PivotTransformer.scala.

Verification

The bare-aggregate path is correct: count / approx_count_distinct (the non-null-defaultResult aggregates that reach the fast path) are never NULL on a non-empty bucket, so the Coalesce fires only on genuinely-empty buckets; sum/avg/min/max correctly stay NULL (defaultResult None → no Coalesce); the ANSI 0/0 deferral matches the slow path. The gap is specifically NULL-capable composites with a non-null empty-default.

// default that only folds to NULL (e.g. sum(x) + 1) is still returned; its Coalesce evaluates
// to NULL on an empty bucket, which is the correct result.
case other =>
val default = other.transform {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The composite branch can return a non-null default for an aggregate that is NULL-capable on a non-empty bucket, and the caller then wraps it in Coalesce(extractedValue, default). The problem is that PivotFirst.update stores a value only if (value != null) and keeps no presence bit, so a buffer slot is NULL iff the pivot category is absent or present-but-the-value-is-NULLExtractValue can't distinguish them, and the Coalesce defaults both. That's correct for bare count-family aggregates (never NULL on a non-empty bucket), but wrong for composites over a NULL-capable sub-aggregate:

  • PIVOT(nullif(count(v), 5) ...) — a bucket with exactly 5 matching non-null rows: slow path = nullif(5,5) = NULL, fast path = Coalesce(NULL, nullif(0,5)=0) = 0.
  • PIVOT(if(count(v1) > 0, sum(v2), 0) ...) — a non-empty bucket with all-NULL v2: slow path = NULL, fast path = 0.
    Since the flag defaults to true, these are live wrong results, and the pivoted column is also declared non-nullable while it can be NULL. The simplest safe fix is to apply the empty-default only for bare aggregates whose defaultResult is non-null (count-family, where a NULL slot provably means empty) and leave NULL-capable composites to NULL / the slow path; a fuller fix would add a per-slot presence signal to PivotFirst.

assert(!df.schema("b").nullable, s"expected 'b' column to be non-nullable: ${df.schema}")
}

test("pivot integral division of counts throws on an empty bucket under ANSI") {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The composite-aggregate coverage here is all never-NULL-on-non-empty shapes (count(v1) div count(v2) throws under ANSI; the bare-count default), so the suite passes even with the composite Coalesce issue above present. Worth adding a case that exercises a composite which is NULL on a non-empty bucket — e.g. nullif(count(v), N) over a bucket with exactly N matching rows asserting NULL, and if(count(v1) > 0, sum(v2), 0) over a non-empty all-NULL-v2 bucket asserting NULL — comparing against the slow path / flag-off.

}
default match {
case _ if !trimAliases(default).foldable => None
case Literal(null, _) => None

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this guard only catches a syntactic Literal(null), but a default that evaluates to null without being a literal — e.g. CEIL(null) (from CEIL(sum(x))) or (null + cast(1 as bigint)) (from sum(x) + 1) — slips through and yields Some(default), so a Coalesce(extractedValue, <expr→null>) wrapper is emitted (you can see it in the analyzer-results golden, e.g. coalesce(__pivot_CEIL(sum…), CEIL(null))). It's harmless to results (Coalesce(x, null) ≡ x, and nullability is unchanged) but adds dead plan nodes and doesn't match the doc's "Return None for … a literal NULL". Folding the default first would catch these: default.foldable && default.eval() == null => None.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants