fix(pool): monotonic claim cursor — close cross-dispatch task-ring race (intermittent join row drop) by ser-vasilich · Pull Request #232 · RayforceDB/rayforce

ser-vasilich · 2026-06-08T10:15:42Z

Problem

The radix join intermittently dropped a row — inner-join returning a count of 7 where 8 was correct — and ASan (detect_stack_use_after_return) flagged a stack-use-after-return in join_radix_scatter_fn reading its dispatcher's stack context. It reproduced only on certain core counts / scheduling (green on macOS and in release; red on ubuntu-latest, debug), which is why it surfaced as a flaky CI failure.

Root cause — a pool race, not a join bug

ray_pool_dispatch / ray_pool_dispatch_n reset task_tail = 0 and refilled the task ring on every dispatch. The spin-wait barrier waits only for pending == 0 (all tasks done), not for worker quiescence — the existing comments already note "workers may still be between pending-- and sem_wait" and "no semaphore — avoids surplus-signal bug between consecutive dispatches".

So a worker that woke late on a surplus work_ready signal could execute one more fetch_add(task_tail) while the next dispatch was mid-republish (reset + refill), claiming a slot from the new window before it was fully set up. The join radix partition runs two back-to-back parallel dispatches (histogram → scatter) with almost no work between them, which is exactly the timing that exposes it: a scatter task got double-run / skipped, and because the scatter advances c->offsets[tid] in place, one partition's entries were overwritten → a build-side row vanished → the probe missed it → count 7.

(Histogram and scatter use byte-identical row-range math, so it is not a morsel-boundary mismatch; and a bigger ring only changes timing — it does not remove the reset.)

Fix — monotonic claim cursors (lock-free, no quiescence wait)

Replace the reset-each-dispatch task_tail/task_count (u32) with two monotonic 64-bit cursors that are never reset: task_claim (next ticket) and task_limit (published window end). Each dispatch carves a fresh window [base, base+n) off the high-water mark and publishes it with a single store-release; a ticket maps to ring slot ticket & (task_cap-1).

Workers claim with a bounded CAS (pool_claim) that never advances task_claim past task_limit, so the cursor stays exact across dispatches. A late worker therefore either:

sees no work (cur >= limit → breaks, touching nothing), or
claims a valid ticket of the current window and runs the right task.

The half-republished slot is structurally impossible — there is no reset. No quiescence wait is added (that would serialise on the slowest worker and regress throughput). The hot path keeps the same shape: one bounded CAS per task instead of one fetch_add; tasks are morsel-coarse so claim contention is negligible.

Verification

Reproducer: a 70k-row radix inner-join repeated in a tight loop. Before: ~1 dropped row per ~2 000 joins. After: 35 000+ joins under gcc+ASan (stack-use-after-return on) and 10 000 under clang+ASan — zero failures.
Full suite green under gcc+ASan/UBSan: 3232 / 3234 (2 pre-existing skips), 0 failed.
No perf regression: release benchmark (20M rows, -O3 -march=native) — by-group radix dispatch, whole-table parallel reduce, and a morsel-heavy TASK_GRAIN filter are all within run-to-run noise vs master (by-group if anything marginally faster).
Regression test: integration/joins.rfl gains a >65 536-row join repeated 200× — the back-to-back radix-dispatch path that exposed the race.

Containment: task_claim/task_limit are pool-internal; no callers outside pool.c touch these fields.

🤖 Generated with Claude Code

ray_pool_dispatch/_n reset task_tail to 0 and refilled the task ring on every dispatch. The spin-wait barrier waits only for pending==0 (all tasks done), not for worker quiescence — the code comments already note "workers may still be between pending-- and sem_wait". A worker that woke late on a surplus work_ready signal could therefore execute one more fetch_add(task_tail) while the NEXT dispatch was mid-republish (task_tail=0 + ring refill), claiming a slot from the new window before it was fully set up. For the join radix partition (two back-to-back parallel dispatches: histogram then scatter) this double-ran/skipped a scatter task, and since the scatter advances c->offsets[tid] in place, a partition's entries got overwritten — dropping a row from the join (intermittent count 7 vs 8, ASan stack-use-after-return on the dispatcher's stack ctx; only reproduced under specific core counts). Replace the reset-each-dispatch cursor with two MONOTONIC 64-bit cursors that are never reset: task_claim (next ticket) and task_limit (published window end). Each dispatch carves a fresh window [base, base+n) off the high-water mark and publishes it with a single store-release; a ticket maps to ring slot (ticket & (task_cap-1)). Workers claim with a bounded CAS (pool_claim) that never advances task_claim past task_limit, so the cursor stays exact across dispatches and a late worker either sees no work (cur>=limit -> break, touches nothing) or claims a valid ticket of the current window — the half-republished slot is structurally impossible. No reset, no quiescence wait (which would serialise on the slowest worker and regress throughput). Hot path is unchanged in shape — one bounded CAS per task instead of one fetch_add; tasks are morsel-coarse so claim contention is negligible. Verified: 35k+ large-radix joins under gcc+ASan with stack-use-after-return detection (was ~1 bad per 2k before) — zero failures; full suite green under gcc+ASan (3232/3234, 0 failed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…(M-pool) Adds a >65536-row inner join repeated 200× to integration/joins.rfl. The big right side drives the parallel radix scatter (two back-to-back pool dispatches), the path whose cross-dispatch task-ring race intermittently dropped a join row. The fixed monotonic-claim pool returns the exact count every iteration; on the buggy pool the loop sum falls below 1600 whenever a row is dropped. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ser-vasilich and others added 2 commits June 8, 2026 13:04

singaraiona merged commit 17b7743 into master Jun 8, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pool): monotonic claim cursor — close cross-dispatch task-ring race (intermittent join row drop)#232

fix(pool): monotonic claim cursor — close cross-dispatch task-ring race (intermittent join row drop)#232
singaraiona merged 2 commits into
masterfrom
fix/pool-dispatch-cross-race

ser-vasilich commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ser-vasilich commented Jun 8, 2026

Problem

Root cause — a pool race, not a join bug

Fix — monotonic claim cursors (lock-free, no quiescence wait)

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants