[SPARK-57715][SQL] PIVOT count() empty buckets should return 0 instead of NULL by miland-db · Pull Request #56808 · apache/spark

miland-db · 2026-06-26T11:53:42Z

What changes were proposed in this pull request?

PIVOT (count(val) FOR cat IN (...)) returned NULL instead of 0 for pivot categories with no matching rows. This differs from standard SQL behavior and from systems such as DuckDB and SQL Server, which return 0.

Root cause: the fast path (PivotFirst + ExtractValue) leaves the slot of an empty pivot bucket unset, so ExtractValue returns NULL. The slow path rewrites each aggregate as aggFunc(if(pivotCol <=> value, child, NULL)) and is already correct: count(NULL) is 0.

Why are the changes needed?

This is a correctness bug, present since pivot was introduced.

Does this PR introduce any user-facing change?

Yes. PIVOT with count now returns the aggregate's empty-input value (e.g. 0) instead of NULL for pivot categories with no matching rows.

How was this patch tested?

New tests and golden files.

Was this patch authored or co-authored using generative AI tooling?

No.

miland-db · 2026-06-26T13:04:07Z

Adding @cloud-fan

MaxGekk

1 blocking, 2 non-blocking, 0 nits. Sound fix for the bare-count case; one correctness gap in the composite branch plus a dead-Coalesce nit.

Correctness (2)

The fast-path Coalesce(extractedValue, default) assumes "fast-path NULL ⟺ empty bucket", but PivotFirst.update stores nothing for null values and keeps no presence bit, so a present-but-NULL composite bucket is indistinguishable from an empty one. Composites like nullif(count(v), 5) or if(count(v1) > 0, sum(v2), 0) then return the empty-default instead of NULL — a silent wrong result (flag on by default), plus a pivoted column declared non-nullable that can in fact be NULL. See inline on PivotTransformer.scala.
No test covers a composite aggregate that is NULL on a non-empty bucket (the case above); the composite tests use only never-NULL shapes. See inline on DataFramePivotSuite.scala.

Suggestions (1)

The case Literal(null, _) => None guard misses null-evaluating non-literal defaults (CEIL(null), null + 1), so redundant Coalesce(x, <expr→null>) wrappers are emitted (visible in the goldens). Harmless to results but dead plan nodes — fold the default first. See inline on PivotTransformer.scala.

Verification

The bare-aggregate path is correct: count / approx_count_distinct (the non-null-defaultResult aggregates that reach the fast path) are never NULL on a non-empty bucket, so the Coalesce fires only on genuinely-empty buckets; sum/avg/min/max correctly stay NULL (defaultResult None → no Coalesce); the ANSI 0/0 deferral matches the slow path. The gap is specifically NULL-capable composites with a non-null empty-default.

MaxGekk · 2026-06-26T16:53:54Z

+      // default that only folds to NULL (e.g. sum(x) + 1) is still returned; its Coalesce evaluates
+      // to NULL on an empty bucket, which is the correct result.
+      case other =>
+        val default = other.transform {


The composite branch can return a non-null default for an aggregate that is NULL-capable on a non-empty bucket, and the caller then wraps it in Coalesce(extractedValue, default). The problem is that PivotFirst.update stores a value only if (value != null) and keeps no presence bit, so a buffer slot is NULL iff the pivot category is absent or present-but-the-value-is-NULL — ExtractValue can't distinguish them, and the Coalesce defaults both. That's correct for bare count-family aggregates (never NULL on a non-empty bucket), but wrong for composites over a NULL-capable sub-aggregate:

PIVOT(nullif(count(v), 5) ...) — a bucket with exactly 5 matching non-null rows: slow path = nullif(5,5) = NULL, fast path = Coalesce(NULL, nullif(0,5)=0) = 0.

PIVOT(if(count(v1) > 0, sum(v2), 0) ...) — a non-empty bucket with all-NULL v2: slow path = NULL, fast path = 0.
Since the flag defaults to true, these are live wrong results, and the pivoted column is also declared non-nullable while it can be NULL. The simplest safe fix is to apply the empty-default only for bare aggregates whose defaultResult is non-null (count-family, where a NULL slot provably means empty) and leave NULL-capable composites to NULL / the slow path; a fuller fix would add a per-slot presence signal to PivotFirst.

MaxGekk · 2026-06-26T16:53:54Z

+    assert(!df.schema("b").nullable, s"expected 'b' column to be non-nullable: ${df.schema}")
+  }
+
+  test("pivot integral division of counts throws on an empty bucket under ANSI") {


The composite-aggregate coverage here is all never-NULL-on-non-empty shapes (count(v1) div count(v2) throws under ANSI; the bare-count default), so the suite passes even with the composite Coalesce issue above present. Worth adding a case that exercises a composite which is NULL on a non-empty bucket — e.g. nullif(count(v), N) over a bucket with exactly N matching rows asserting NULL, and if(count(v1) > 0, sum(v2), 0) over a non-empty all-NULL-v2 bucket asserting NULL — comparing against the slow path / flag-off.

MaxGekk · 2026-06-26T16:53:54Z

+        }
+        default match {
+          case _ if !trimAliases(default).foldable => None
+          case Literal(null, _) => None


Nit: this guard only catches a syntactic Literal(null), but a default that evaluates to null without being a literal — e.g. CEIL(null) (from CEIL(sum(x))) or (null + cast(1 as bigint)) (from sum(x) + 1) — slips through and yields Some(default), so a Coalesce(extractedValue, <expr→null>) wrapper is emitted (you can see it in the analyzer-results golden, e.g. coalesce(__pivot_CEIL(sum…), CEIL(null))). It's harmless to results (Coalesce(x, null) ≡ x, and nullability is unchanged) but adds dead plan nodes and doesn't match the doc's "Return None for … a literal NULL". Folding the default first would catch these: default.foldable && default.eval() == null => None.

miland-db added 3 commits June 26, 2026 11:53

Init commit

258b950

Regenerate golden files

cdb3e1b

Add golden file

b6b6e1c

miland-db changed the title ~~Fix PIVOT~~ [SQL] PIVOT count() empty buckets should return 0 instead of NULL Jun 26, 2026

miland-db changed the title ~~[SQL] PIVOT count() empty buckets should return 0 instead of NULL~~ [SPARK-57715][SQL] PIVOT count() empty buckets should return 0 instead of NULL Jun 26, 2026

MaxGekk reviewed Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-57715][SQL] PIVOT count() empty buckets should return 0 instead of NULL#56808

[SPARK-57715][SQL] PIVOT count() empty buckets should return 0 instead of NULL#56808
miland-db wants to merge 3 commits into
apache:masterfrom
miland-db:milan-dankovic_data/pivot-fix

miland-db commented Jun 26, 2026 •

edited

Loading

Uh oh!

miland-db commented Jun 26, 2026

Uh oh!

MaxGekk left a comment

Uh oh!

MaxGekk Jun 26, 2026

Uh oh!

MaxGekk Jun 26, 2026

Uh oh!

MaxGekk Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

miland-db commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

miland-db commented Jun 26, 2026

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Correctness (2)

Suggestions (1)

Verification

Uh oh!

MaxGekk Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

MaxGekk Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

MaxGekk Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

miland-db commented Jun 26, 2026 •

edited

Loading