Skip to content

fix(pool): monotonic claim cursor — close cross-dispatch task-ring race (intermittent join row drop)#232

Merged
singaraiona merged 2 commits into
masterfrom
fix/pool-dispatch-cross-race
Jun 8, 2026
Merged

fix(pool): monotonic claim cursor — close cross-dispatch task-ring race (intermittent join row drop)#232
singaraiona merged 2 commits into
masterfrom
fix/pool-dispatch-cross-race

Conversation

@ser-vasilich

Copy link
Copy Markdown
Collaborator

Problem

The radix join intermittently dropped a row — inner-join returning a count of 7 where 8 was correct — and ASan (detect_stack_use_after_return) flagged a stack-use-after-return in join_radix_scatter_fn reading its dispatcher's stack context. It reproduced only on certain core counts / scheduling (green on macOS and in release; red on ubuntu-latest, debug), which is why it surfaced as a flaky CI failure.

Root cause — a pool race, not a join bug

ray_pool_dispatch / ray_pool_dispatch_n reset task_tail = 0 and refilled the task ring on every dispatch. The spin-wait barrier waits only for pending == 0 (all tasks done), not for worker quiescence — the existing comments already note "workers may still be between pending-- and sem_wait" and "no semaphore — avoids surplus-signal bug between consecutive dispatches".

So a worker that woke late on a surplus work_ready signal could execute one more fetch_add(task_tail) while the next dispatch was mid-republish (reset + refill), claiming a slot from the new window before it was fully set up. The join radix partition runs two back-to-back parallel dispatches (histogram → scatter) with almost no work between them, which is exactly the timing that exposes it: a scatter task got double-run / skipped, and because the scatter advances c->offsets[tid] in place, one partition's entries were overwritten → a build-side row vanished → the probe missed it → count 7.

(Histogram and scatter use byte-identical row-range math, so it is not a morsel-boundary mismatch; and a bigger ring only changes timing — it does not remove the reset.)

Fix — monotonic claim cursors (lock-free, no quiescence wait)

Replace the reset-each-dispatch task_tail/task_count (u32) with two monotonic 64-bit cursors that are never reset: task_claim (next ticket) and task_limit (published window end). Each dispatch carves a fresh window [base, base+n) off the high-water mark and publishes it with a single store-release; a ticket maps to ring slot ticket & (task_cap-1).

Workers claim with a bounded CAS (pool_claim) that never advances task_claim past task_limit, so the cursor stays exact across dispatches. A late worker therefore either:

  • sees no work (cur >= limit → breaks, touching nothing), or
  • claims a valid ticket of the current window and runs the right task.

The half-republished slot is structurally impossible — there is no reset. No quiescence wait is added (that would serialise on the slowest worker and regress throughput). The hot path keeps the same shape: one bounded CAS per task instead of one fetch_add; tasks are morsel-coarse so claim contention is negligible.

Verification

  • Reproducer: a 70k-row radix inner-join repeated in a tight loop. Before: ~1 dropped row per ~2 000 joins. After: 35 000+ joins under gcc+ASan (stack-use-after-return on) and 10 000 under clang+ASan — zero failures.
  • Full suite green under gcc+ASan/UBSan: 3232 / 3234 (2 pre-existing skips), 0 failed.
  • No perf regression: release benchmark (20M rows, -O3 -march=native) — by-group radix dispatch, whole-table parallel reduce, and a morsel-heavy TASK_GRAIN filter are all within run-to-run noise vs master (by-group if anything marginally faster).
  • Regression test: integration/joins.rfl gains a >65 536-row join repeated 200× — the back-to-back radix-dispatch path that exposed the race.

Containment: task_claim/task_limit are pool-internal; no callers outside pool.c touch these fields.

🤖 Generated with Claude Code

ser-vasilich and others added 2 commits June 8, 2026 13:04
ray_pool_dispatch/_n reset task_tail to 0 and refilled the task ring on every
dispatch.  The spin-wait barrier waits only for pending==0 (all tasks done),
not for worker quiescence — the code comments already note "workers may still
be between pending-- and sem_wait".  A worker that woke late on a surplus
work_ready signal could therefore execute one more fetch_add(task_tail) while
the NEXT dispatch was mid-republish (task_tail=0 + ring refill), claiming a
slot from the new window before it was fully set up.  For the join radix
partition (two back-to-back parallel dispatches: histogram then scatter) this
double-ran/skipped a scatter task, and since the scatter advances
c->offsets[tid] in place, a partition's entries got overwritten — dropping a
row from the join (intermittent count 7 vs 8, ASan stack-use-after-return on
the dispatcher's stack ctx; only reproduced under specific core counts).

Replace the reset-each-dispatch cursor with two MONOTONIC 64-bit cursors that
are never reset: task_claim (next ticket) and task_limit (published window
end).  Each dispatch carves a fresh window [base, base+n) off the high-water
mark and publishes it with a single store-release; a ticket maps to ring slot
(ticket & (task_cap-1)).  Workers claim with a bounded CAS (pool_claim) that
never advances task_claim past task_limit, so the cursor stays exact across
dispatches and a late worker either sees no work (cur>=limit -> break, touches
nothing) or claims a valid ticket of the current window — the half-republished
slot is structurally impossible.  No reset, no quiescence wait (which would
serialise on the slowest worker and regress throughput).

Hot path is unchanged in shape — one bounded CAS per task instead of one
fetch_add; tasks are morsel-coarse so claim contention is negligible.

Verified: 35k+ large-radix joins under gcc+ASan with stack-use-after-return
detection (was ~1 bad per 2k before) — zero failures; full suite green under
gcc+ASan (3232/3234, 0 failed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…(M-pool)

Adds a >65536-row inner join repeated 200× to integration/joins.rfl.  The big
right side drives the parallel radix scatter (two back-to-back pool dispatches),
the path whose cross-dispatch task-ring race intermittently dropped a join row.
The fixed monotonic-claim pool returns the exact count every iteration; on the
buggy pool the loop sum falls below 1600 whenever a row is dropped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@singaraiona singaraiona merged commit 17b7743 into master Jun 8, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants