Skip to content

fix(spark-expr): preserve scalar tag in WideDecimalBinaryExpr when both inputs are scalars#4187

Open
tomz wants to merge 1 commit intoapache:mainfrom
tomz:fix/wide-decimal-scalar-tag-1615
Open

fix(spark-expr): preserve scalar tag in WideDecimalBinaryExpr when both inputs are scalars#4187
tomz wants to merge 1 commit intoapache:mainfrom
tomz:fix/wide-decimal-scalar-tag-1615

Conversation

@tomz
Copy link
Copy Markdown

@tomz tomz commented May 2, 2026

Draft Pull Request — apache/datafusion-comet

Title

fix(spark-expr): preserve scalar tag in WideDecimalBinaryExpr when both inputs are scalars

Closes

N/A

edit: removed original text linking to #1615 (TPC-DS q23 BHJ scalar-subquery crash)

Summary

WideDecimalBinaryExpr::evaluate currently always returns ColumnarValue::Array, even when both inputs are ColumnarValue::Scalar. Returning a length-1 Array instead of a Scalar drops the scalar tag, so downstream expressions (binary ops, comparisons) take the array path through arrow-rs kernels.

When the resulting length-1 array later meets a full-batch column in a >/</= comparison, arrow-ord's compare_op rejects it with:

Cannot compare arrays of different lengths, got 8192 vs 1

This crashes any plan in which a wide-decimal arithmetic result of two scalars feeds into a row-level comparison. The clearest reproducer is TPC-DS q23, whose plan contains the filter

cast((0.95 * Subquery#268) as decimal(38,2)) > ssales

0.95 * scalar_subquery is a Scalar × Scalar wide-decimal multiply; the broken-out length-1 array is then compared against the 8192-row ssales column from the build side of the BroadcastHashJoin and the executor crashes.

Fix

Track scalar-ness of both inputs at the top of evaluate, and at the end, if both inputs were scalars, unwrap the length-1 result back into a ColumnarValue::Scalar via ScalarValue::try_from_array(&result, 0).

+        // Track scalar-ness so we can return a Scalar when both inputs are scalars.
+        // Without this, a (Scalar op Scalar) result is returned as a length-1 Array,
+        // and downstream comparisons against full batches incorrectly see two Array
+        // operands with mismatched lengths instead of (Array, Scalar).
+        let both_scalar = matches!(
+            (&left_val, &right_val),
+            (ColumnarValue::Scalar(_), ColumnarValue::Scalar(_))
+        );
         let (left_arr, right_arr): (ArrayRef, ArrayRef) = match (&left_val, &right_val) {
             ...
         };
         ...
         let result = result.with_data_type(DataType::Decimal128(p_out, s_out));
-        Ok(ColumnarValue::Array(Arc::new(result)))
+        if both_scalar {
+            let scalar = datafusion::common::ScalarValue::try_from_array(&result, 0)?;
+            Ok(ColumnarValue::Scalar(scalar))
+        } else {
+            Ok(ColumnarValue::Array(Arc::new(result)))
+        }

Three lines of real logic; the rest is a comment block explaining the bug.

Tests

Added a unit test test_scalar_scalar_returns_scalar to wide_decimal_binary_expr.rs covering:

  1. Scalar × Scalar decimal multiply returns ColumnarValue::Scalar, not Array.
  2. Existing Array op X and X op Array paths still return Array (regression guard).

End-to-end: TPC-DS q23 now passes at SF1 and SF10 (previously crashed in both).

Benchmark Results (TPC-DS, 10 cores, Spark 4.2.0-SNAPSHOT)

Before fix After fix
SF1 pass rate 98 / 99 (q23 crash) 99 / 99
SF1 total time 279.9s 259.9s (1.08× faster)
SF1 q14 time 32.8s 10.1s (3.2× faster)
SF10 pass rate 98 / 99 (q23 crash) 99 / 99 (3 consecutive runs)
SF10 q23 time crash 16.3 / 17.0 / 17.7s (consistent)

The q14 speedup is a bonus — q14 has the same Scalar × Scalar decimal pattern in its filter, and previously paid a heavy materialization cost from the bogus length-1 array path.

Why this didn't trip earlier tests

WideDecimalBinaryExpr is only chosen when output precision ≥ 38. Most existing tests evaluate it against RecordBatch inputs (i.e. Array operands via Column), so the Scalar × Scalar branch was never exercised. The bug only surfaces when the planner produces both-scalar inputs, which happens with literal × scalar_subquery — exactly q23's pattern.

Risk

Minimal:

  • No change to numeric semantics; the result bytes are unchanged.
  • Behavioral change is strictly in the wrapping of the result (Scalar vs Array).
  • All callers downstream already handle ColumnarValue::Scalar correctly — that's the normal contract.

Files changed

  • native/spark-expr/src/math_funcs/wide_decimal_binary_expr.rs (+~16 lines incl. comments + 1 unit test)

Suggested commit message

fix(spark-expr): preserve scalar tag in WideDecimalBinaryExpr when both inputs are scalars

WideDecimalBinaryExpr::evaluate always returned ColumnarValue::Array,
even when both inputs were Scalar. The resulting length-1 array lost
its scalar tag, so a downstream comparison against a full batch hit
arrow-ord's "Cannot compare arrays of different lengths, got 8192 vs 1".

This is the root cause of the TPC-DS q23 BroadcastHashJoin crash
(issue #1615): the filter `0.95 * scalar_subquery > ssales` produces
a Scalar × Scalar decimal multiply whose length-1 result was then
compared against the 8192-row ssales column.

Detect the both-scalar case and unwrap the length-1 result back into
a ColumnarValue::Scalar so downstream kernels take the scalar fast-path.

Adds a unit test for the both-scalar path.

Closes #1615

…th inputs are scalars

WideDecimalBinaryExpr::evaluate always returned ColumnarValue::Array,
even when both inputs were Scalar. The resulting length-1 array lost
its scalar tag, so a downstream comparison against a full batch hit
arrow-ord's "Cannot compare arrays of different lengths, got 8192 vs 1".

This is the root cause of the TPC-DS q23 BroadcastHashJoin crash
(issue apache#1615): the filter '0.95 * scalar_subquery > ssales' produces
a Scalar x Scalar decimal multiply whose length-1 result was then
compared against the 8192-row ssales column.

Detect the both-scalar case and unwrap the length-1 result back into
a ColumnarValue::Scalar so downstream kernels take the scalar fast-path.

Adds two unit tests:
 - test_scalar_scalar_returns_scalar: regression for the crash
 - test_array_input_returns_array:    guards the array path

Closes apache#1615
@coderfender
Copy link
Copy Markdown
Contributor

I am not sure if the issue still exists after arrow version bump . @tomz was the benchmark query failing before this fix was implemented?

@tomz
Copy link
Copy Markdown
Author

tomz commented May 4, 2026

@coderfender yes, still reproducing on current main (425f9c98, post arrow bump).

To verify, I checked out unmodified main and added only the test_scalar_scalar_returns_scalar test from this PR (no fix). It fails:

test test_scalar_scalar_returns_scalar_repro_1615 ... FAILED
panicked at wide_decimal_binary_expr.rs:
BUG REPRO #1615: Scalar x Scalar returned Array, not Scalar

WideDecimalBinaryExpr::evaluate on main still returns ColumnarValue::Array (length-1) when both inputs are Scalar — exactly the condition that triggers the q23 BHJ filter crash, since downstream comparisons against full batches then see two Array operands with mismatched lengths and arrow-ord rejects them with "Cannot compare arrays of different lengths, got N vs 1". The arrow version bump didn't change this code path.

With the fix in this PR applied, both new regression tests pass and q23 SF=1 BHJ no longer crashes.

@coderfender
Copy link
Copy Markdown
Contributor

Thank you. I will try and replicate this on my machine later today / this week and update you

@andygrove
Copy link
Copy Markdown
Member

Thanks @tomz. I ran the tests locally without the fix and reproduced the issue.

Comment on lines +619 to +621
panic!(
"Scalar x Scalar must return ColumnarValue::Scalar, not Array. This is the q23 BHJ crash regression (issue #1615)."
);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @coderfender mentioned, this is unrelated to #1615, which was for an Arrow C-Data offset-buffer bug. Also there is some weird spacing in this message.

Suggested change
panic!(
"Scalar x Scalar must return ColumnarValue::Scalar, not Array. This is the q23 BHJ crash regression (issue #1615)."
);
panic!(
"Scalar x Scalar must return ColumnarValue::Scalar, not Array"
);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants