Skip to content

feat: improve shuffle size estimation [experimental!]#4164

Draft
andygrove wants to merge 1 commit intoapache:mainfrom
andygrove:worktree-shuffle-size-multiplier
Draft

feat: improve shuffle size estimation [experimental!]#4164
andygrove wants to merge 1 commit intoapache:mainfrom
andygrove:worktree-shuffle-size-multiplier

Conversation

@andygrove
Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #.

Rationale for this change

Change the default sizeInBytesMultiplier from 1.0 to 2.0 so that out-of-the-box Comet better preserves Spark's join strategy decisions.

Add a new config spark.comet.shuffle.sizeInBytesMultiplier.dynamic that, when enabled, computes the multiplier from the shuffle output schema by comparing per-field Arrow columnar widths against UnsafeRow fixed-width costs. This gives a more accurate estimate for schemas where the ratio varies significantly (e.g. narrow bool/byte columns vs wide long columns).

What changes are included in this PR?

How are these changes tested?

…plier

Change the default sizeInBytesMultiplier from 1.0 to 2.0 so that
out-of-the-box Comet better preserves Spark's join strategy decisions.

Add a new config spark.comet.shuffle.sizeInBytesMultiplier.dynamic that,
when enabled, computes the multiplier from the shuffle output schema by
comparing per-field Arrow columnar widths against UnsafeRow fixed-width
costs. This gives a more accurate estimate for schemas where the ratio
varies significantly (e.g. narrow bool/byte columns vs wide long columns).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant