fix(pool): monotonic claim cursor — close cross-dispatch task-ring race (intermittent join row drop)#232
Merged
Merged
Conversation
ray_pool_dispatch/_n reset task_tail to 0 and refilled the task ring on every dispatch. The spin-wait barrier waits only for pending==0 (all tasks done), not for worker quiescence — the code comments already note "workers may still be between pending-- and sem_wait". A worker that woke late on a surplus work_ready signal could therefore execute one more fetch_add(task_tail) while the NEXT dispatch was mid-republish (task_tail=0 + ring refill), claiming a slot from the new window before it was fully set up. For the join radix partition (two back-to-back parallel dispatches: histogram then scatter) this double-ran/skipped a scatter task, and since the scatter advances c->offsets[tid] in place, a partition's entries got overwritten — dropping a row from the join (intermittent count 7 vs 8, ASan stack-use-after-return on the dispatcher's stack ctx; only reproduced under specific core counts). Replace the reset-each-dispatch cursor with two MONOTONIC 64-bit cursors that are never reset: task_claim (next ticket) and task_limit (published window end). Each dispatch carves a fresh window [base, base+n) off the high-water mark and publishes it with a single store-release; a ticket maps to ring slot (ticket & (task_cap-1)). Workers claim with a bounded CAS (pool_claim) that never advances task_claim past task_limit, so the cursor stays exact across dispatches and a late worker either sees no work (cur>=limit -> break, touches nothing) or claims a valid ticket of the current window — the half-republished slot is structurally impossible. No reset, no quiescence wait (which would serialise on the slowest worker and regress throughput). Hot path is unchanged in shape — one bounded CAS per task instead of one fetch_add; tasks are morsel-coarse so claim contention is negligible. Verified: 35k+ large-radix joins under gcc+ASan with stack-use-after-return detection (was ~1 bad per 2k before) — zero failures; full suite green under gcc+ASan (3232/3234, 0 failed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(M-pool) Adds a >65536-row inner join repeated 200× to integration/joins.rfl. The big right side drives the parallel radix scatter (two back-to-back pool dispatches), the path whose cross-dispatch task-ring race intermittently dropped a join row. The fixed monotonic-claim pool returns the exact count every iteration; on the buggy pool the loop sum falls below 1600 whenever a row is dropped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The radix join intermittently dropped a row —
inner-joinreturning a count of 7 where 8 was correct — and ASan (detect_stack_use_after_return) flagged astack-use-after-returninjoin_radix_scatter_fnreading its dispatcher's stack context. It reproduced only on certain core counts / scheduling (green on macOS and in release; red onubuntu-latest, debug), which is why it surfaced as a flaky CI failure.Root cause — a pool race, not a join bug
ray_pool_dispatch/ray_pool_dispatch_nresettask_tail = 0and refilled the task ring on every dispatch. The spin-wait barrier waits only forpending == 0(all tasks done), not for worker quiescence — the existing comments already note "workers may still be between pending-- and sem_wait" and "no semaphore — avoids surplus-signal bug between consecutive dispatches".So a worker that woke late on a surplus
work_readysignal could execute one morefetch_add(task_tail)while the next dispatch was mid-republish (reset + refill), claiming a slot from the new window before it was fully set up. The join radix partition runs two back-to-back parallel dispatches (histogram → scatter) with almost no work between them, which is exactly the timing that exposes it: a scatter task got double-run / skipped, and because the scatter advancesc->offsets[tid]in place, one partition's entries were overwritten → a build-side row vanished → the probe missed it → count 7.(Histogram and scatter use byte-identical row-range math, so it is not a morsel-boundary mismatch; and a bigger ring only changes timing — it does not remove the reset.)
Fix — monotonic claim cursors (lock-free, no quiescence wait)
Replace the reset-each-dispatch
task_tail/task_count(u32) with two monotonic 64-bit cursors that are never reset:task_claim(next ticket) andtask_limit(published window end). Each dispatch carves a fresh window[base, base+n)off the high-water mark and publishes it with a single store-release; a ticket maps to ring slotticket & (task_cap-1).Workers claim with a bounded CAS (
pool_claim) that never advancestask_claimpasttask_limit, so the cursor stays exact across dispatches. A late worker therefore either:cur >= limit→ breaks, touching nothing), orThe half-republished slot is structurally impossible — there is no reset. No quiescence wait is added (that would serialise on the slowest worker and regress throughput). The hot path keeps the same shape: one bounded CAS per task instead of one
fetch_add; tasks are morsel-coarse so claim contention is negligible.Verification
inner-joinrepeated in a tight loop. Before: ~1 dropped row per ~2 000 joins. After: 35 000+ joins under gcc+ASan (stack-use-after-return on) and 10 000 under clang+ASan — zero failures.-O3 -march=native) — by-group radix dispatch, whole-table parallel reduce, and a morsel-heavyTASK_GRAINfilter are all within run-to-run noise vsmaster(by-group if anything marginally faster).integration/joins.rflgains a >65 536-row join repeated 200× — the back-to-back radix-dispatch path that exposed the race.Containment:
task_claim/task_limitare pool-internal; no callers outsidepool.ctouch these fields.🤖 Generated with Claude Code