[SPARK-57715][SQL] PIVOT count() empty buckets should return 0 instead of NULL#56808
[SPARK-57715][SQL] PIVOT count() empty buckets should return 0 instead of NULL#56808miland-db wants to merge 3 commits into
Conversation
|
Adding @cloud-fan |
MaxGekk
left a comment
There was a problem hiding this comment.
1 blocking, 2 non-blocking, 0 nits. Sound fix for the bare-count case; one correctness gap in the composite branch plus a dead-Coalesce nit.
Correctness (2)
- The fast-path
Coalesce(extractedValue, default)assumes "fast-path NULL ⟺ empty bucket", butPivotFirst.updatestores nothing for null values and keeps no presence bit, so a present-but-NULL composite bucket is indistinguishable from an empty one. Composites likenullif(count(v), 5)orif(count(v1) > 0, sum(v2), 0)then return the empty-default instead of NULL — a silent wrong result (flag on by default), plus a pivoted column declared non-nullable that can in fact be NULL. See inline onPivotTransformer.scala. - No test covers a composite aggregate that is NULL on a non-empty bucket (the case above); the composite tests use only never-NULL shapes. See inline on
DataFramePivotSuite.scala.
Suggestions (1)
- The
case Literal(null, _) => Noneguard misses null-evaluating non-literal defaults (CEIL(null),null + 1), so redundantCoalesce(x, <expr→null>)wrappers are emitted (visible in the goldens). Harmless to results but dead plan nodes — fold the default first. See inline onPivotTransformer.scala.
Verification
The bare-aggregate path is correct: count / approx_count_distinct (the non-null-defaultResult aggregates that reach the fast path) are never NULL on a non-empty bucket, so the Coalesce fires only on genuinely-empty buckets; sum/avg/min/max correctly stay NULL (defaultResult None → no Coalesce); the ANSI 0/0 deferral matches the slow path. The gap is specifically NULL-capable composites with a non-null empty-default.
| // default that only folds to NULL (e.g. sum(x) + 1) is still returned; its Coalesce evaluates | ||
| // to NULL on an empty bucket, which is the correct result. | ||
| case other => | ||
| val default = other.transform { |
There was a problem hiding this comment.
The composite branch can return a non-null default for an aggregate that is NULL-capable on a non-empty bucket, and the caller then wraps it in Coalesce(extractedValue, default). The problem is that PivotFirst.update stores a value only if (value != null) and keeps no presence bit, so a buffer slot is NULL iff the pivot category is absent or present-but-the-value-is-NULL — ExtractValue can't distinguish them, and the Coalesce defaults both. That's correct for bare count-family aggregates (never NULL on a non-empty bucket), but wrong for composites over a NULL-capable sub-aggregate:
PIVOT(nullif(count(v), 5) ...)— a bucket with exactly 5 matching non-null rows: slow path =nullif(5,5)= NULL, fast path =Coalesce(NULL, nullif(0,5)=0)=0.PIVOT(if(count(v1) > 0, sum(v2), 0) ...)— a non-empty bucket with all-NULLv2: slow path = NULL, fast path =0.
Since the flag defaults totrue, these are live wrong results, and the pivoted column is also declared non-nullable while it can be NULL. The simplest safe fix is to apply the empty-default only for bare aggregates whosedefaultResultis non-null (count-family, where a NULL slot provably means empty) and leave NULL-capable composites to NULL / the slow path; a fuller fix would add a per-slot presence signal toPivotFirst.
| assert(!df.schema("b").nullable, s"expected 'b' column to be non-nullable: ${df.schema}") | ||
| } | ||
|
|
||
| test("pivot integral division of counts throws on an empty bucket under ANSI") { |
There was a problem hiding this comment.
The composite-aggregate coverage here is all never-NULL-on-non-empty shapes (count(v1) div count(v2) throws under ANSI; the bare-count default), so the suite passes even with the composite Coalesce issue above present. Worth adding a case that exercises a composite which is NULL on a non-empty bucket — e.g. nullif(count(v), N) over a bucket with exactly N matching rows asserting NULL, and if(count(v1) > 0, sum(v2), 0) over a non-empty all-NULL-v2 bucket asserting NULL — comparing against the slow path / flag-off.
| } | ||
| default match { | ||
| case _ if !trimAliases(default).foldable => None | ||
| case Literal(null, _) => None |
There was a problem hiding this comment.
Nit: this guard only catches a syntactic Literal(null), but a default that evaluates to null without being a literal — e.g. CEIL(null) (from CEIL(sum(x))) or (null + cast(1 as bigint)) (from sum(x) + 1) — slips through and yields Some(default), so a Coalesce(extractedValue, <expr→null>) wrapper is emitted (you can see it in the analyzer-results golden, e.g. coalesce(__pivot_CEIL(sum…), CEIL(null))). It's harmless to results (Coalesce(x, null) ≡ x, and nullability is unchanged) but adds dead plan nodes and doesn't match the doc's "Return None for … a literal NULL". Folding the default first would catch these: default.foldable && default.eval() == null => None.
What changes were proposed in this pull request?
PIVOT (count(val) FOR cat IN (...))returnedNULLinstead of0for pivot categories with no matching rows. This differs from standard SQL behavior and from systems such as DuckDB and SQL Server, which return0.Root cause: the fast path (PivotFirst + ExtractValue) leaves the slot of an empty pivot bucket unset, so
ExtractValuereturnsNULL. The slow path rewrites each aggregate asaggFunc(if(pivotCol <=> value, child, NULL))and is already correct:count(NULL)is 0.Why are the changes needed?
This is a correctness bug, present since pivot was introduced.
Does this PR introduce any user-facing change?
Yes.
PIVOTwith count now returns the aggregate's empty-input value (e.g. 0) instead ofNULLfor pivot categories with no matching rows.How was this patch tested?
New tests and golden files.
Was this patch authored or co-authored using generative AI tooling?
No.