feat: improve shuffle size estimation [experimental!] by andygrove · Pull Request #4164 · apache/datafusion-comet

andygrove · 2026-04-30T19:14:50Z

Which issue does this PR close?

Closes #.

Rationale for this change

Change the default sizeInBytesMultiplier from 1.0 to 2.0 so that out-of-the-box Comet better preserves Spark's join strategy decisions.

Add a new config spark.comet.shuffle.sizeInBytesMultiplier.dynamic that, when enabled, computes the multiplier from the shuffle output schema by comparing per-field Arrow columnar widths against UnsafeRow fixed-width costs. This gives a more accurate estimate for schemas where the ratio varies significantly (e.g. narrow bool/byte columns vs wide long columns).

What changes are included in this PR?

How are these changes tested?

…plier Change the default sizeInBytesMultiplier from 1.0 to 2.0 so that out-of-the-box Comet better preserves Spark's join strategy decisions. Add a new config spark.comet.shuffle.sizeInBytesMultiplier.dynamic that, when enabled, computes the multiplier from the shuffle output schema by comparing per-field Arrow columnar widths against UnsafeRow fixed-width costs. This gives a more accurate estimate for schemas where the ratio varies significantly (e.g. narrow bool/byte columns vs wide long columns).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: improve shuffle size estimation [experimental!]#4164

feat: improve shuffle size estimation [experimental!]#4164
andygrove wants to merge 1 commit intoapache:mainfrom
andygrove:worktree-shuffle-size-multiplier

andygrove commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

andygrove commented Apr 30, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant