fix(spark-expr): preserve scalar tag in WideDecimalBinaryExpr when both inputs are scalars#4187
fix(spark-expr): preserve scalar tag in WideDecimalBinaryExpr when both inputs are scalars#4187tomz wants to merge 1 commit intoapache:mainfrom
Conversation
…th inputs are scalars WideDecimalBinaryExpr::evaluate always returned ColumnarValue::Array, even when both inputs were Scalar. The resulting length-1 array lost its scalar tag, so a downstream comparison against a full batch hit arrow-ord's "Cannot compare arrays of different lengths, got 8192 vs 1". This is the root cause of the TPC-DS q23 BroadcastHashJoin crash (issue apache#1615): the filter '0.95 * scalar_subquery > ssales' produces a Scalar x Scalar decimal multiply whose length-1 result was then compared against the 8192-row ssales column. Detect the both-scalar case and unwrap the length-1 result back into a ColumnarValue::Scalar so downstream kernels take the scalar fast-path. Adds two unit tests: - test_scalar_scalar_returns_scalar: regression for the crash - test_array_input_returns_array: guards the array path Closes apache#1615
|
I am not sure if the issue still exists after arrow version bump . @tomz was the benchmark query failing before this fix was implemented? |
|
@coderfender yes, still reproducing on current To verify, I checked out unmodified
With the fix in this PR applied, both new regression tests pass and q23 SF=1 BHJ no longer crashes. |
|
Thank you. I will try and replicate this on my machine later today / this week and update you |
|
Thanks @tomz. I ran the tests locally without the fix and reproduced the issue. |
| panic!( | ||
| "Scalar x Scalar must return ColumnarValue::Scalar, not Array. This is the q23 BHJ crash regression (issue #1615)." | ||
| ); |
There was a problem hiding this comment.
As @coderfender mentioned, this is unrelated to #1615, which was for an Arrow C-Data offset-buffer bug. Also there is some weird spacing in this message.
| panic!( | |
| "Scalar x Scalar must return ColumnarValue::Scalar, not Array. This is the q23 BHJ crash regression (issue #1615)." | |
| ); | |
| panic!( | |
| "Scalar x Scalar must return ColumnarValue::Scalar, not Array" | |
| ); |
Draft Pull Request — apache/datafusion-comet
Title
fix(spark-expr): preserve scalar tag in
WideDecimalBinaryExprwhen both inputs are scalarsCloses
N/A
edit: removed original text linking to #1615 (TPC-DS q23 BHJ scalar-subquery crash)
Summary
WideDecimalBinaryExpr::evaluatecurrently always returnsColumnarValue::Array, even when both inputs areColumnarValue::Scalar. Returning a length-1Arrayinstead of aScalardrops the scalar tag, so downstream expressions (binary ops, comparisons) take the array path through arrow-rs kernels.When the resulting length-1 array later meets a full-batch column in a
>/</=comparison, arrow-ord'scompare_oprejects it with:This crashes any plan in which a wide-decimal arithmetic result of two scalars feeds into a row-level comparison. The clearest reproducer is TPC-DS q23, whose plan contains the filter
0.95 * scalar_subqueryis aScalar × Scalarwide-decimal multiply; the broken-out length-1 array is then compared against the 8192-rowssalescolumn from the build side of the BroadcastHashJoin and the executor crashes.Fix
Track scalar-ness of both inputs at the top of
evaluate, and at the end, if both inputs were scalars, unwrap the length-1 result back into aColumnarValue::ScalarviaScalarValue::try_from_array(&result, 0).Three lines of real logic; the rest is a comment block explaining the bug.
Tests
Added a unit test
test_scalar_scalar_returns_scalartowide_decimal_binary_expr.rscovering:Scalar × Scalardecimal multiply returnsColumnarValue::Scalar, notArray.Array op XandX op Arraypaths still returnArray(regression guard).End-to-end: TPC-DS q23 now passes at SF1 and SF10 (previously crashed in both).
Benchmark Results (TPC-DS, 10 cores, Spark 4.2.0-SNAPSHOT)
The q14 speedup is a bonus — q14 has the same
Scalar × Scalardecimal pattern in its filter, and previously paid a heavy materialization cost from the bogus length-1 array path.Why this didn't trip earlier tests
WideDecimalBinaryExpris only chosen when output precision ≥ 38. Most existing tests evaluate it againstRecordBatchinputs (i.e.Arrayoperands viaColumn), so theScalar × Scalarbranch was never exercised. The bug only surfaces when the planner produces both-scalar inputs, which happens withliteral × scalar_subquery— exactly q23's pattern.Risk
Minimal:
ScalarvsArray).ColumnarValue::Scalarcorrectly — that's the normal contract.Files changed
native/spark-expr/src/math_funcs/wide_decimal_binary_expr.rs(+~16 lines incl. comments + 1 unit test)Suggested commit message