docs(design): propose leader balance scheduler by bootjp · Pull Request #953 · bootjp/elastickv

bootjp · 2026-06-11T04:52:39Z

Summary

Doc-only PR. Proposes a TiKV-style leader balance scheduler that automatically spreads Raft-group leaderships across nodes so one node does not end up leading every group (docs/design/2026_06_11_proposed_leader_balance_scheduler.md). Per the repo's design-doc-first workflow, no implementation code is included.

Goal (user's words): "TiKVのようにリーダー以外のノードに別のRaftグループのリーダーをおいて負荷を均等化したい" — put other groups' leaders on the non-leader nodes to even out the load. elastickv has no PD, so the equivalent of PD's balance-leader scheduler is embedded in the default-group leader.

Key design points

Who balances: the default-group leader (already hosts HLC renewal + the M3 auto-split scheduler). Balancer state is leader-local and resets on election via RegisterLeaderLossCallback.
Observation: local per-group State()/Leader() reads (every node runs every group), sibling to the existing publishLeaderTerms loop — no peer-polling RPC. No-leader groups are skipped.
Policy: pure count-based function. Act only when the max-minus-min leader count is >= 2 (the spread-of-1 remainder must not trigger); deterministic source (most) / target (fewest) with node/group-ID tie-breaks; strict-spread-decrease guard; one transfer per cycle; per-group (30s) + global (10s) cooldowns on a monotonic clock; load-weighted balancing is explicit future work.
Mechanism: always the targeted TransferLeadershipToServer. A follower cannot initiate a transfer (handleTransferLeadership rejects with errLeadershipTransferNotLeader), so if the over-loaded source node is not the balancer host, the request is forwarded over the existing per-group RaftAdmin gRPC service to that group's leader.
Safety/exclusions: skip groups mid-conf-change; never target a lagging follower (etcd/raft stalls the transfer until catch-up); never fight SQS leadership refusal (v1 excludes partitioned-FIFO groups from balancing); reserve --leaderBalancePinGroups; default group is balanceable but the doc calls out the HLC-renewal restart + one lease-read blip; rely on cooldowns rather than a hard SplitJob interlock (M3 is not a dependency).
Ops: default OFF behind --leaderBalance + runtime kill switch (atomic), bounded-cardinality metrics (leaderbalance_*, fixed-enum/per-node labels only), structured slog (group_id, from, to, ...), and the existing raftadmin CLI as the manual escape hatch.

Open questions

OQ-1 — count vs. load for v1.
OQ-2 — SQS-refused groups: exclude entirely vs. publish per-node htfifo capability and balance among capable nodes.
OQ-3 — default group: balance first, last, or pinned by default?
OQ-4 — forward path: reuse operator-facing RaftAdmin gRPC vs. a purpose-built internal RPC.
OQ-5 — cleanest in-tree "conf-change pending" signal per group.
OQ-6 — split-job interlock vs. cooldown-only.
OQ-7 — per-node gauge cardinality vs. a single spread scalar.
OQ-8 — default tuning values (threshold 2, cooldowns 30s/10s, interval 30s) vs. derive from election timeout.

Test plan

Doc-only — no code, no tests. The doc itself specifies the future test strategy (pure-policy table tests, rapid property invariants, 3-node convergence / anti-ping-pong integration, kill-switch + leader-change reset, and a Jepsen "no new anomalies with balancer on" run).

Summary by CodeRabbit

Documentation
- Added design proposal for a Leader Balance Scheduler that improves cluster stability through even distribution of group leaderships across nodes. Specification covers observation mechanisms, deterministic balancing policies, safety guardrails, eligibility criteria, operational controls, metrics instrumentation, and a three-phase implementation plan with comprehensive testing strategy.

bootjp · 2026-06-11T04:52:46Z

@claude review

coderabbitai · 2026-06-11T04:52:50Z

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 13 minutes and 42 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more credits in the billing tab to continue.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 90057e86-6364-44b0-9362-58c0c8d197bf

📥 Commits

Reviewing files that changed from the base of the PR and between d2e84be and c7ab367.

📒 Files selected for processing (1)

docs/design/2026_06_11_proposed_leader_balance_scheduler.md

📝 Walkthrough

Walkthrough

This PR adds a design proposal for an in-process leader-balancing scheduler that distributes Raft-group leaderships across cluster nodes. It specifies observation mechanisms, deterministic balancing policy with hysteresis, safety exclusions, transfer execution via gRPC, operational controls, and a phased implementation plan with testing strategy.

Changes

Leader Balance Scheduler Design Proposal

Layer / File(s)	Summary
Leader Balance Scheduler Design Specification `docs/design/2026_06_11_proposed_leader_balance_scheduler.md`	Complete design proposal specifying scheduler execution context (default-group leader only), local leadership observation, deterministic count-based balancing policy with imbalance thresholds and hysteresis, safety exclusions (conf-changes, lagging targets, SQS refusal/pinning, default-group handling), and transfer execution via per-group `RaftAdmin` gRPC. Includes operational controls (flags, kill switch, metrics, logging), three-PR implementation roadmap (observe → policy+transfer → pinning/hardening), comprehensive testing strategy (unit/property/integration/Jepsen), and future work directions.

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 A design emerges from the burrow's deep thought,
Leadership spreads where before it was caught,
Schedulers balance with careful control,
Each node plays a part in achieving the goal—
Safety and metrics, a roadmap so clear! 🌿

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'docs(design): propose leader balance scheduler' clearly and concisely describes the main change—adding a design proposal document for a leader balance scheduler.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch design/leader-balance-scheduler

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-06-11T04:52:59Z

Claude finished @bootjp's task in 4m 10s —— View job

Review: `docs/design/2026_06_11_proposed_leader_balance_scheduler.md`

Gather context and understand the request
Read the design document
Review against existing architecture and conventions
Post comprehensive review

Overall assessment: Accept with revisions. The design is thorough, well-grounded in the existing codebase, and follows the established conventions from the SQS leadership-refusal and M3 scheduler patterns. Code citations are accurate (verified shard_config.go:61-99, :386-397, main.go:1606-1620, internal/raftengine/etcd/engine.go:1754-1757). The milestone breakdown is clean and independently shippable. A few substantive gaps and clarity issues below warrant attention before moving to implementation.

Substantive issues (should be resolved before PR2)

1. Progress-visibility gap in the forward path (§3.5, §3.4 case 2)

The doc proposes filtering candidate targets by "followers whose Progress.Match is close enough to the leader's commit index." However, Progress is a leader-only view in etcd/raft — when the balancer host is a follower of the group being rebalanced (the case 2 forward path), its local engine has no Progress for that group's voters. The pre-filter described in §3.5 is only achievable in case 1 (the balancer leads the group directly).

For case 2, the implementation must either (a) skip the progress pre-filter and rely on waitForLeadershipTransfer / errLeadershipTransferRejected on the remote leader, or (b) send the target preference in the forwarded RPC and let the remote leader apply the filter. The design should explicitly acknowledge this and state which path is taken in PR2, since it affects how lagging-follower risk is handled for the majority of inter-node transfers.

2. No candidate solution for OQ-5 (conf-change pending signal)

OQ-5 is left fully open, but the implementation path matters: pendingConfigs is an unexported field of internal/raftengine/etcd/engine.go:390 and is not surfaced in the Status struct (engine.go:67-83). The currently exported Status has no PendingConfChange field. Three approaches to surface it:

Add PendingConfChange bool to Status (populated from len(e.pendingConfigs) > 0 at the etcd backend level, :3141);
Add a dedicated HasPendingConfChange(ctx) bool method to the Admin interface;
Use etcd/raft's rawNode.BasicStatus().Config to check PendingConfIndex — but this leaks etcd internals past the engine boundary.

The design should name a preferred option so implementers don't rediscover this gap during PR2. Option 1 (extend Status) is the cleanest and mirrors how LeadTransferee was added.

3. Missing eligibility filter: LeadTransferee != 0

Status.LeadTransferee is already in the exported Status struct (engine.go:82, populated at etcd backend :2908). A non-zero LeadTransferee means a transfer is already in flight on a group the local node leads. The scheduler should skip groups where Status().LeadTransferee != 0 — otherwise it may issue a second TransferLeadershipToServer call while the first is still in waitForLeadershipTransfer, which would hit errLeadershipTransferRejected (:1765-1768) and log a spurious failure. Add this as an explicit skip condition in §3.5 alongside the conf-change check.

Design clarity issues

4. Forward path address — implicit assumption (§3.4)

The forward path dials the source node's group listener address from engine.Configuration(ctx). The Configuration call returns raft voter addresses — the same addresses registered via AddVoter(ctx, id, address, ...) — which are rt.spec.address values. The RaftAdmin gRPC service is also registered at rt.spec.address (:1610). This is an implicit assumption that raft peer address == RaftAdmin listener address. It should be explicitly verified and stated in the design (one sentence), since a future multi-port setup could break it silently.

5. "Strict-spread-decrease" description is slightly imprecise (§3.3)

The guard targetCount + 1 > sourceCount - 1 skips transfers that would invert relative rankings, but passes transfers where both sides end up equal (e.g., moving from spread-2 to spread-0). The description "ensures a transfer never makes things worse or merely shuffles a tie" is close but "strictly reduces spread" isn't quite right — what the guard actually prevents is post-transfer inversion. Suggest: "skip if the transfer would not reduce the spread — i.e., skip when targetCount + 1 >= sourceCount - 1, which would leave spread unchanged or inverted."

(Aside: confirm the guard direction — > vs >=. With source=2, target=0: 0+1 > 2-1 → 1 > 1 → false → transfer allowed. Post-transfer spread = 0. Correct. With source=2, target=1: 1+1 > 2-1 → 2 > 1 → true → skipped. Correct. The > is right.)

6. OQ-3 lacks a recommendation

The doc presents three options (balance first / last / pinned-by-default) but makes no recommendation. Per the design-doc-first workflow, the doc's job is to converge on a decision, not just enumerate options. A lightweight recommendation (e.g., "balance last by default, since it is the only group the balancer itself runs on and a stale cooldown after a transfer-induced leadership change resets all state") gives reviewers something concrete to react to.

Smaller issues

7. leaderbalance_leaders_per_node excludes no-leader groups silently

§3.2 says unknown/no-leader groups are excluded from the count map. If several groups are mid-election simultaneously, the gauge becomes misleading (sums to fewer than the actual group count). A companion leaderbalance_unobservable_groups gauge (count of groups skipped this cycle due to unknown leader) would help operators distinguish "balanced" from "partially observed."

8. PR-doc milestone naming (§4)

The last row of the milestone table calls the lifecycle rename PR "PR-doc", but this is the current PR. Rename to "This PR / PR0" or note that *_proposed_* → *_partial_* is done as part of the PR1 commit (so no separate PR is needed for the first rename).

9. Integration test setup (§5)

"Force all leaderships onto one node — drive transfers there, or restart the others" are two different starting states (clean forced imbalance vs. snapshot/recovery overhead). The convergence test should pick one and state the mechanism explicitly, so the test is deterministic and reproducible across CI runs.

10. Five-lens table row for concurrency — forward-path TOCTOU

The concurrency lens in §6 says "verify the forward path targets the current leader of the group and fails closed (errLeadershipTransferNotLeader) if the leader moved under it." This is good. Worth also explicitly noting the TOCTOU window: the balancer reads Leader().ID at observation time (step 1), then dials that node at transfer time (step 2). If the leader changes between steps 1 and 2, the dial target is stale. The remote handleTransferLeadership correctly rejects with errLeadershipTransferNotLeader (:1754-1757), which the balancer must map to a logged skip (not a fatal error). Make this explicit in the safety lens.

What's well done

The HLC ceiling-invariant analysis for default-group transfer (§3.5) is correct: the 3s physical window > 1s renewal interval guarantees no timestamp reissue inside the old leader's window — this is the same guarantee that holds across a natural election. Appreciated that this is spelled out explicitly.
The SQS leadership-refusal analysis (§3.5) correctly identifies the ping-pong risk and takes the conservative v1 stance (exclude entirely). OQ-2 is the right follow-on.
monoclock for cooldown deadlines (§3.3) is correct per CLAUDE.md conventions.
The TransferLeadershipToServer vs. TransferLeadership note (§3.4) is important — using targeted form is required for the balance objective.
Milestone PR1 (observe-only, no transfers) is the right first step; it de-risks the observation loop before any transfers are issued.

gemini-code-assist

Code Review

This pull request proposes a design document for a Leader Balance Scheduler to automatically distribute Raft-group leaderships across nodes in elastickv. The feedback highlights three key technical issues to address in the design: the risk of transfer storms or thrashing during default-group leadership changes (recommending a startup warm-up delay), an authentication gap for forwarded gRPC requests when --adminTokenFile is enabled, and a fundamental technical gap where follower nodes lack the replication progress and liveness information of peer followers needed to safely select a transfer target.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-11T04:54:19Z

+- The M3 auto-split scheduler is placed there too (sibling doc §4.1, §7.6) — co-locating the leader balancer keeps "one scheduler seat per cluster" rather than electing yet another coordinator.
+- There is exactly one default-group leader at a time, so the scheduler is singleton without extra leader election.
+
+**State reset on election.** All balancer state (per-group cooldown deadlines, the global cooldown deadline, the last observed leader map) is **leader-local, in-memory, not Raft-replicated**. On a default-group leadership change the deposed leader's scheduler goroutine stops (via `RegisterLeaderLossCallback`, `kv/coordinator.go:131`) and the new leader starts with empty state. The worst case of a lost cooldown is a *too-soon* extra transfer (mildly wasteful, self-correcting next cycle), never an unsafe action — so non-replication is the right cost/safety trade (same rationale as M3 §7.6). We state this explicitly so it is not mistaken for an oversight.


Risk of Transfer Storms on Default-Group Leadership Changes

If the default-group leader flaps rapidly (e.g., due to transient network instability or CPU starvation), each newly elected leader will start with an empty cooldown state and might immediately trigger a transfer. This can lead to a "transfer storm" or thrashing of group leaderships across the cluster.

This risk is particularly acute when transferring the leadership of the default group itself. Since the balancer runs on the default-group leader, transferring the default group forces the balancer to restart on a new node with a completely empty cooldown state. Without an initial startup delay, the new balancer might immediately trigger subsequent transfers, leading to thrashing.

Recommendation:
Introduce an initial startup delay or "warm-up" cooldown for the scheduler upon election (e.g., waiting at least one leaderBalanceInterval or a global cooldown before initiating the first transfer). This ensures the cluster has stabilized after a default-group election before the balancer begins making changes.

Already in §3.1 line 75 ("Startup grace period (anti-storm on balancer-host churn)."): on every acquisition of the balancer role the scheduler arms a startup-grace deadline of max(leaderBalanceGlobalCooldown, one leaderBalanceInterval) during which it observes-and-emits-metrics-but-issues-no-transfer, and the deadline is re-armed on every role acquisition so a flapping default leader cannot accumulate enough stable time to fire. The recursive default-group-self-transfer case the recommendation calls out is also covered there. Skipped during the grace window are counted under leaderbalance_skipped_total{reason=startup_grace}.

gemini-code-assist · 2026-06-11T04:54:20Z

+The balancer host (the default-group leader) is a member of every group but is, in general, a **follower** of the group it wants to rebalance (indeed, the whole point is to move leadership *off* the over-loaded node, which may or may not be the balancer host). Two sub-cases:
+
+1. **The over-loaded source node is the balancer host itself** (it leads the target group). Then the balancer calls `engine.TransferLeadershipToServer(ctx, target.ID, target.Address)` directly on the **local** runtime's engine for that group — the local engine is the leader, the call is accepted, and the goroutine-offload + idempotency patterns from `main_sqs_leadership_refusal.go:88-95` apply verbatim.
+2. **The over-loaded source node is a *different* node.** The balancer host is a follower of that group and cannot initiate the transfer locally. It **forwards** the request to the source node's leader of that group, using the in-tree per-group `RaftAdmin` gRPC service: dial the source node's group listener (`rt.spec.address` for that group is in the group's `Configuration`, available via `engine.Configuration(ctx)`, `internal/raftengine/engine.go:213-215`) and call `RaftAdmin.TransferLeadership` with `TargetId`/`TargetAddress` set to the chosen target (the same RPC `cmd/raftadmin` uses, `internal/raftadmin/server.go:155-172`, `cmd/raftadmin/main.go:359-378`). The receiving node's engine is the leader of that group, so `handleTransferLeadership` accepts it.


Authentication Gap for Forwarded gRPC Requests

If the cluster is configured with --adminTokenFile (enabling bearer-token authentication on the RaftAdmin gRPC service), any forwarded TransferLeadership requests sent by the balancer host to other nodes will be rejected with Unauthenticated or PermissionDenied unless the balancer's gRPC client presents the correct token.

Recommendation:
The design should explicitly specify how the balancer host obtains and attaches this token (e.g., by sharing the loaded token or the token file path with the balancer's gRPC client/connection cache) to ensure seamless operation in secure environments.

gemini-code-assist · 2026-06-11T04:54:20Z

+A group is **eligible** for a transfer this cycle only if **all** of the following hold; otherwise it is skipped (and counted in a skip metric):
+
+- **No conf-change in flight.** Skip a group whose membership is changing. (Surface via the engine status / configuration read; never transfer mid-membership-change — a transfer racing a conf-change can land on a member about to be removed.) **OQ-5** asks for the cleanest in-tree signal for "conf-change pending" per group.
+- **Target follower is caught up / healthy.** etcd/raft only *completes* a transfer once the transferee's log has caught up to the leader (`waitForLeadershipTransfer` will otherwise observe the transfer abort, `internal/raftengine/etcd/engine.go:1384-1405`, `:1431-1437`). Transferring to a lagging follower stalls writes on that group until the transferee catches up or raft aborts. The scheduler therefore **must not pick a lagging follower as target**: it filters targets to followers whose `Progress.Match` (or the available `Status` liveness/last-contact proxy) is close enough to the leader's commit index, and prefers a follower with recent contact. The targeted transfer also fails fast at submit time if raft drops it (`errLeadershipTransferRejected`, `:1765-1768`), so a bad target surfaces as a logged failure rather than a silent stall.


Technical Gap: Followers Lack Progress and Liveness Information for Target Selection

There is a fundamental technical gap in the proposed target selection mechanism: in etcd/raft, only the active leader of a Raft group maintains the Progress map (tracking Match and Next indices) and receives active heartbeat responses (MsgHeartbeatResp / MsgAppResp) from followers. Followers do not exchange heartbeats with each other and do not have access to the replication progress or liveness of peer followers.

Since the balancer host (the default-group leader) is a follower for most other Raft groups, its local engine for those groups will have an empty/unpopulated Progress map and no liveness/last-contact info for peer followers. Thus, the balancer host cannot locally filter or select a healthy/caught-up target follower for groups it does not lead.

If it attempts to make a targeted transfer (TransferLeadershipToServer) without this information, it risks selecting a lagging, partitioned, or dead follower, which will stall the group's write path (since etcd/raft drops proposals while LeadTransferee is set).

Recommendation:
Consider one of the following approaches to resolve this:

Ordered Preference List: The balancer host passes an ordered list of preferred target nodes (from least-loaded to most-loaded) when forwarding the request to the group's leader. The group's leader then selects the first target in the list that is healthy and caught up according to its local Progress map.

Leader-Delegated Selection: The balancer host queries the group's leader for the list of healthy/caught-up voters before making the decision, or delegates the final target selection to the group's leader within a set of acceptable least-loaded candidates.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d2e84be826

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T04:54:36Z

+
+### 1.1 The problem
+
+elastickv runs multiple Raft groups in one process (`--raftGroups id=addr,id=addr,…`, `shard_config.go:61-99`; default group is the lowest ID, `shard_config.go:386-397`). Every node is a member of **every** group: `buildShardGroups` iterates the parsed `groups` once per process and constructs a `raftGroupRuntime` per group (`main.go:786-889`), each with its own engine and its own gRPC listener at `rt.spec.address` (`main.go:1606-1620`). Leadership of each group is elected independently by etcd/raft, so there is **no mechanism today that spreads leaderships across nodes**. After a rolling restart, a partition heal, or simply unlucky election timing, one node can end up leading every group while its peers lead none. That node then carries all the leader-only work — write proposals, HLC ceiling renewal, lease reads, OCC timestamp issuance, route-catalog proposes — while the rest of the cluster sits idle. The user's explicit goal: "TiKVのようにリーダー以外のノードに別のRaftグループのリーダーをおいて負荷を均等化したい" — put other groups' leaders on the non-leader nodes so the load is even.


Add the missing multi-node multi-group topology milestone

This premise does not hold for the current server: resolveBootstrapServers rejects --raftBootstrapMembers whenever len(groups) != 1 (main.go:742-748), and the existing M5 script explicitly treats multi-group as a single-process/single-member topology. As written, the later scheduler PRs have no deployable/testable topology where every node is a voter in every group, so TransferLeadershipToServer cannot actually spread Raft-group leaders across nodes until the design first adds per-group peer/bootstrap membership support.

Useful? React with 👍 / 👎.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

docs/design/2026_06_11_proposed_leader_balance_scheduler.md (4)
82-87: 💤 Low value

Acknowledge TOCTOU race in forwarded transfer path.

When the balancer observes "node X leads group Y" and forwards a TransferLeadership RPC to X, leadership of Y could change between observation and RPC delivery. The design notes (line 81) that the call will fail with errLeadershipTransferNotLeader but doesn't discuss whether the balancer retries, logs the failure, or simply waits for the next cycle.

Consider adding a brief note to §3.4 that forwarded-path failures due to leadership changes are tolerated (counted in leaderbalance_transfers_failed_total{reason=not_leader}, no retry, next cycle observes fresh state). This clarifies the expected behavior and prevents implementation uncertainty.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/design/2026_06_11_proposed_leader_balance_scheduler.md` around lines 82
- 87, Add a short paragraph to §3.4 stating that forwarded TransferLeadership
RPCs (sent via RaftAdmin.TransferLeadership after reading engine.Configuration)
may race with actual leadership changes and can return
errLeadershipTransferNotLeader; these failures must be treated as tolerated
(increment leaderbalance_transfers_failed_total{reason="not_leader"}), do not
trigger immediate retries, and are resolved by the balancer observing fresh
state on the next scheduling cycle rather than retrying the RPC inline.
102-103: 💤 Low value

Document HLC transfer latency assumption.

The safety argument for default-group balancing states: "the window (3 s) exceeds the renewal interval (1 s)" so a transfer preserves the ceiling invariant. This assumes the transfer completes quickly relative to the window. If a transfer takes multiple seconds (e.g., due to slow target catch-up or network delays), the new leader might start renewing closer to the old ceiling's expiration.

Add a sentence noting that TransferLeadership is designed to complete within one election timeout (typically < 1s when the target is healthy), which is well within the 3s window margin. If etcd/raft can delay transfers longer, document that the ceiling safety still holds because the new leader clamps Next() to max(wall, ceiling) regardless of transfer latency—this clarifies the invariant doesn't depend on transfer speed.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/design/2026_06_11_proposed_leader_balance_scheduler.md` around lines 102
- 103, Add a short clarifying sentence to the HLC transfer paragraph stating
that TransferLeadership is intended to complete within one election timeout
(typically <1s for a healthy target), which is safely inside hlcPhysicalWindowMs
(3s) given hlcRenewalInterval (1s); also note that even if etcd/raft delays
TransferLeadership longer, the HLC ceiling safety still holds because the new
leader’s Next() clamps to max(wall, ceiling), and leader loss triggers
lease.invalidate via RegisterLeaderLossCallback so the first read will take the
LinearizableRead path until the lease warms.
122-122: ⚖️ Poor tradeoff

Consider deferring forwarded transfer to PR3 to reduce PR2 scope.

PR2 includes both the local-leader transfer path (case 1, §3.4) and the forwarded-path gRPC client (case 2). The forwarded path adds gRPC dial/credentials handling, connection caching, remote error handling, and the TOCTOU failure mode—significant incremental complexity.

Splitting the implementation could make PR2 easier to review:

PR2: Policy + loop + local transfer only (balancer acts only when it is the over-loaded source node).

PR3: Forwarded path + SQS/pinning awareness.

This keeps PR2 focused on the core policy/loop correctness and defers the distributed-call complexity. The trade-off is that PR2 alone won't balance clusters where the default-group leader is under-loaded, but it's still independently useful and verifiable. Alternatively, keep the current plan if you prefer end-to-end balancing in PR2—this is a reviewability suggestion, not a correctness issue.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/design/2026_06_11_proposed_leader_balance_scheduler.md` at line 122, The
reviewer suggests reducing PR2 scope by removing the forwarded-transfer path and
keeping only the policy function, scheduler loop, and local transfer execution
behind the --leaderBalance switch; refactor so the policy function (policy* /
makePolicy / decideTransfers), scheduler loop (schedulerLoop / runBalancer), and
local transfer code (executeLocalTransfer / performTransfer) remain, while
extracting or gating out forwardedTransfer / grpcClient / connectionCache /
remoteTransfer logic and related tests into a follow-up PR3; update integration
tests (demo.go) and unit tests to only exercise local-source transfers and add
TODOs/placeholders for the forwarded-path, ensuring all forwarded-path gRPC
dialing, credentials, remote error handling, and TOCTOU handling are removed
from PR2.
133-133: ⚡ Quick win

Expand integration test scenarios beyond 3-node.

The test strategy specifies a 3-node convergence test but doesn't cover other important cluster/group configurations that stress different policy code paths.

Suggested additional integration test scenarios

Add to the test plan (or list as future integration tests):

2-node cluster with odd number of groups (e.g., 3 groups): spread is at most 1, balancer should remain idle (threshold not met).

5-node cluster with 4 groups: one node will always have 0 leaders; verify balancer stabilizes at ⌈4/5⌉ = 1 leader per node max.

All groups initially on the default-group leader: verify the balancer balances away from itself (self-evacuation case).

Forwarded-path failure injection: simulate leadership change of target group between observation and RPC; verify the failure is logged and the next cycle recovers.

These cases exercise tie-break logic, self-transfer, and forward-path error handling that the 3-node test might not hit.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/design/2026_06_11_proposed_leader_balance_scheduler.md` at line 133,
Update the integration test plan to add the suggested scenarios beyond the
3-node demo: add a 2-node cluster test with an odd number of groups (e.g., 3) to
assert the balancer remains idle, a 5-node cluster with 4 groups to assert max
leaders per node is ⌈4/5⌉, a test where all groups start on the default-group
leader to exercise self-evacuation, and a forwarded-path failure injection where
a target group’s leadership changes between observation and RPC to assert proper
logging and recovery; wire these into the same demo harness used by
cmd/server/demo.go and the --leaderBalance toggle so they run under the same
convergence/assertion logic and include assertions that the cluster stabilizes
and does not ping-pong after convergence.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/design/2026_06_11_proposed_leader_balance_scheduler.md`:
- Line 97: The design doc leaves "caught up / healthy" vague; update the
scheduler section to specify concrete numeric eligibility rules and proxies:
state that a follower is eligible iff Progress.Match >= leader.CommitIndex -
maxLag (define maxLag in index units, e.g., 10 entries) and/or if Progress isn't
available use Status.LastContact within healthThreshold (expressed in ms and
tied to electionTimeout, e.g., LastContact <= electionTimeout/2); explicitly
state learners are filtered out before submission (handleTransferLeadership will
also reject them) and note that submit-time rejections surface as
errLeadershipTransferRejected so the scheduler still logs failures.

---

Nitpick comments:
In `@docs/design/2026_06_11_proposed_leader_balance_scheduler.md`:
- Around line 82-87: Add a short paragraph to §3.4 stating that forwarded
TransferLeadership RPCs (sent via RaftAdmin.TransferLeadership after reading
engine.Configuration) may race with actual leadership changes and can return
errLeadershipTransferNotLeader; these failures must be treated as tolerated
(increment leaderbalance_transfers_failed_total{reason="not_leader"}), do not
trigger immediate retries, and are resolved by the balancer observing fresh
state on the next scheduling cycle rather than retrying the RPC inline.
- Around line 102-103: Add a short clarifying sentence to the HLC transfer
paragraph stating that TransferLeadership is intended to complete within one
election timeout (typically <1s for a healthy target), which is safely inside
hlcPhysicalWindowMs (3s) given hlcRenewalInterval (1s); also note that even if
etcd/raft delays TransferLeadership longer, the HLC ceiling safety still holds
because the new leader’s Next() clamps to max(wall, ceiling), and leader loss
triggers lease.invalidate via RegisterLeaderLossCallback so the first read will
take the LinearizableRead path until the lease warms.
- Line 122: The reviewer suggests reducing PR2 scope by removing the
forwarded-transfer path and keeping only the policy function, scheduler loop,
and local transfer execution behind the --leaderBalance switch; refactor so the
policy function (policy* / makePolicy / decideTransfers), scheduler loop
(schedulerLoop / runBalancer), and local transfer code (executeLocalTransfer /
performTransfer) remain, while extracting or gating out forwardedTransfer /
grpcClient / connectionCache / remoteTransfer logic and related tests into a
follow-up PR3; update integration tests (demo.go) and unit tests to only
exercise local-source transfers and add TODOs/placeholders for the
forwarded-path, ensuring all forwarded-path gRPC dialing, credentials, remote
error handling, and TOCTOU handling are removed from PR2.
- Line 133: Update the integration test plan to add the suggested scenarios
beyond the 3-node demo: add a 2-node cluster test with an odd number of groups
(e.g., 3) to assert the balancer remains idle, a 5-node cluster with 4 groups to
assert max leaders per node is ⌈4/5⌉, a test where all groups start on the
default-group leader to exercise self-evacuation, and a forwarded-path failure
injection where a target group’s leadership changes between observation and RPC
to assert proper logging and recovery; wire these into the same demo harness
used by cmd/server/demo.go and the --leaderBalance toggle so they run under the
same convergence/assertion logic and include assertions that the cluster
stabilizes and does not ping-pong after convergence.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: abd72a72-2b66-4689-9ff8-b39225e520c6

📥 Commits

Reviewing files that changed from the base of the PR and between ed82233 and d2e84be.

📒 Files selected for processing (1)

docs/design/2026_06_11_proposed_leader_balance_scheduler.md

- P1 (codex): add 1.1a topology prerequisite + PR0 milestone. The premise "every node is a member of every group" does not hold for any deployable multi-node topology: resolveBootstrapServers rejects multi-group bootstrap (main.go:742-748), groupSpec is single-address (shard_config.go:14-17), buildRuntimeForGroup gives every group nil peers (multiraft_runtime.go:246-254), and run-jepsen-m5-local.sh documents single-process single-member-per-group only. Scheduler PRs are blocked on PR0. OQ-9 records bootstrap-extension vs. AddVoter-composition. - gemini (1): add startup grace period (3.1) after acquiring the balancer role to break the transfer-storm loop when the default-group leader (the balancer host) flaps; --leaderBalanceStartupGrace flag + startup_grace skip metric. - gemini (2): state RaftAdmin security posture (3.4 + Security lens). Corrected mechanism: --adminTokenFile gates only /Admin/, not /RaftAdmin/ (admin_grpc.go:484,498) — so forwarded transfers are NOT token-rejected, but RaftAdmin is unauthenticated; require trusted-network boundary. OQ-11. - gemini (3): make target catch-up validation the executing group leader's job (3.5) — Progress is leader-only and absent from exported Status; balancer passes a preference, the leader (case 1 local / case 2 remote) gates health. OQ-10 for ordered preference list. - Also: LeadTransferee!=0 skip + OQ-5 PendingConfChange recommendation; updated milestones (PR0), test strategy, five-lens table, lifecycle. Doc-only; no code/tests.

bootjp · 2026-06-11T05:06:53Z

@claude review

claude · 2026-06-11T05:07:09Z

Claude finished @bootjp's task in 3m 41s —— View job

Review: `docs/design/2026_06_11_proposed_leader_balance_scheduler.md`

Gather context and understand the request
Read the design document
Review against existing architecture and prior review comments
Post comprehensive review

Overall: Accept with revisions. The 8b0e094 update addressed most of the first-round issues — the Progress/ownership analysis (§3.5), startup-grace anti-storm (§3.1), security posture of the forward path (§3.4), TOCTOU documentation in §6, deterministic integration test starting state (§5), and the OQ-5/LeadTransferee eligibility additions. Several items remain open and one is a clear contradiction introduced by the new §1.1a.

Must-fix before marking `_partial_`

1. §1.1a self-contradicts on PR1 blocking

§1.1a closes with: "The scheduler PRs (PR1–PR3) are blocked on PR0 — until a group can have voters on more than one node, the balancer is observe-only and has nothing to transfer."

But the milestone table entry for PR1 says: "Pure observability — shippable even before PR0", and §9 says the *_partial_* rename fires "after the first milestone lands (PR0 topology and/or PR1 observation + metrics)".

The contradiction is factual: PR1 has no transfers and is not blocked on PR0. Change the §1.1a sentence to "The scheduler PRs (PR2–PR3) are blocked on PR0" and let PR1 stand as the unblocked observe-only milestone it already is. Fix this →

2. "may" should be "should" for case-1 Progress pre-filter (§3.5)

"Case 1 (balancer host leads the group): the balancer does have the live Progress map for that group, so it may pre-filter the preferred target by Progress.Match close to commit index plus recent contact."

may makes this optional. But the whole safety argument for not stalling writes is that the executing leader validates catch-up. In case 1 the balancer is the leader — it has the live Progress map, so it is the most accurate place to filter, and failing to do so exposes the group to write stalls if a lagging target is selected and raft must wait. Change may to should. Fix this →

3. Concrete catch-up threshold still undefined for case 1

CodeRabbit flagged this as critical and it remains unresolved. §3.5 describes the case-1 filter as "close to commit index plus recent contact" but gives no number. Without a concrete threshold implementers will each pick something different.

A minimal specification: define maxLag in index units and express it relative to election timeout (e.g., maxLag = raftElectionTimeoutTicks × tickInterval entries, so a follower that is at most one election timeout behind in the log is eligible). Status.LastContact is already in the exported Status struct (internal/raftengine/engine.go:77) — state a bound (e.g., LastContact < electionTimeout / 2) as the liveness gate. Both are constants available at scheduler construction time and can be plumbed as flags (--leaderBalanceMaxTargetLag, --leaderBalanceMaxTargetLastContact). If the preference is to rely entirely on the submit-time rejection from etcd/raft even in case 1, say that explicitly: "case 1 also relies on submit-time rejection; no pre-filter is applied". Either way, name the behavior.

Should resolve before PR2 starts

4. OQ-3 needs a recommendation

OQ-3 asks whether the default group is balanced first, last, or pinned by default. §3.5 correctly notes the transient effects (HLC renewal restart, one lease-read blip). Per the design-doc-first workflow, the doc should converge, not enumerate. A concrete recommendation: balance last by default (after all other groups are at or below ⌈groups/nodes⌉, only then include the default group in the eligible set). This minimizes disruption: the most impactful transient effects (HLC + lease blip) are deferred until the low-cost moves are exhausted, and an operator who wants to pin the default group entirely can use --leaderBalancePinGroups. Add this recommendation to §8 OQ-3 and §3.5.

5. OQ-9 needs a recommendation

OQ-9 asks bootstrap-extension vs. AddVoter-composition for the PR0 topology. This is the blocking prerequisite for PR2/PR3. The doc presents both options neutrally, but landing PR0 requires choosing one. A concrete recommendation: option (a) — extend the bootstrap/flag surface (lift the len(groups)==1 guard, add per-group peer lists to --raftGroups or a companion flag) is the right first step: it makes the topology first-class at startup and avoids a manual AddVoter dance in every test harness. The AddVoter-composition path remains valid for live topology expansion (membership changes after bootstrap), but not as the gating prerequisite. Add the recommendation to §8 OQ-9.

6. Topological assumption unstated as a deployment constraint

§3.2 says "the default-group leader is itself a member (leader or follower) of every group it co-hosts". The phrase "co-hosts" is doing heavy lifting. The observation loop and the case-2 forward path both silently break if some nodes do not participate in every group (e.g., a 5-node cluster where groups G3–G5 have only a 3-node voter set that excludes the default-group leader). Add one explicit sentence under §2.2 Non-Goals (or §3.2): "The scheduler assumes every node that participates in the default group is also a member (voter or learner) of every other Raft group — i.e., homogeneous group membership across nodes. Deployments with heterogeneous group membership (some groups on a subset of nodes) are out of scope for v1."

Lower-priority / polish

7. leaderbalance_unobservable_groups companion gauge missing

§3.2 says no-leader groups are excluded from the leader-count map for the cycle. If multiple groups are mid-election simultaneously, leaderbalance_leaders_per_node silently sums to fewer than the actual group count, making a "balanced" gauge indistinguishable from "partially observed." Add leaderbalance_unobservable_groups (count of groups skipped this cycle due to unknown/no-leader state) to the gauges in §3.6 alongside the existing leaderbalance_leaders_per_node.

8. PR2 test plan missing a forwarded-path case

§5 describes the convergence integration test (force imbalance → enable --leaderBalance → assert spread ≤ ⌈N/3⌉) but the test setup (all leaders on one node = the default-group leader) means every transfer is case 1 (balancer hosts leads the groups). The case-2 forwarded path (balancer is a follower) is untested by this scenario. Add an explicit integration test where the forced imbalance starts on a non-default-group-leader node to exercise the RaftAdmin forward path end-to-end.

9. "PR-doc" label in milestone table

The last milestone row labelled "PR-doc" names lifecycle renames that start with this PR. Rename to "Doc lifecycle (this PR → *_partial_* after PR1; → *_implemented_* after PR3)" to make clear no separate PR is needed for the first rename.

What's well done in this revision

The §3.1 startup-grace analysis is thorough and correctly identifies the self-reinforcing storm loop (default-group transfer → fresh balancer with empty cooldowns → immediate next transfer). The security posture analysis in §3.4 (OQ-11) is exactly right: the existing --adminTokenFile does not gate /RaftAdmin/, so v1 is operationally correct but requires trusted-network placement — this is properly documented, not swept under the rug. The §3.5 "Ownership rule" correctly assigns Progress-map validation to the executing leader in case 2, resolving the most complex technical gap from round 1. The five-lens table is now correctly filled in for concurrency (TOCTOU, LeadTransferee, forward-path failure modes).
| Branch

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8b0e09411c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T05:08:24Z

+
+### 3.2 Observation (decision: local per-group `State()` / `Leader()`, no polling RPC)
+
+Once the PR0 topology (§1.1a) is in place so each group has voters on more than one node, the default-group leader is itself a member (leader or follower) of every group it co-hosts and can read **the local engine's view of who leads each group** with zero network cost: iterate `runtimes`, and for each `rt.snapshotEngine()` read `State()` and `Leader()` (`internal/raftengine/engine.go:131-138`). (In the single-process multi-group demo this still works, but each group has exactly one voter, so the count map is trivially balanced and no transfer is ever issued — the loop is observe-only there.) This is exactly what `publishLeaderTerms` already does for term (`main.go:2126-2143`); the balancer adds a sibling reader for leader identity. From the per-group `Leader().ID` the scheduler builds the **leader-count map** `nodeID → number of groups this node currently leads`.


Seed zero-leader voters in the count map

When the starting state is the one this scheduler is meant to fix (for example A leads every group and B/C lead none), building the leader-count map only from observed Leader().ID values produces {A: N} and omits B/C, so max(count) - min(count) is 0 and the policy never fires. The observation/policy needs to seed all eligible voter node IDs from each group's Configuration() with count 0 before applying the threshold, otherwise the PR2 convergence test that forces all leaders onto one node cannot pass.

Useful? React with 👍 / 👎.

Already in §3.2 line 81 ("Seed the count map with every voter at zero before counting leaders."). The doc explicitly seeds the union of voter IDs from each co-hosted group's Configuration() with count 0 before incrementing from observed leaders, calls out the exact worst-case the comment describes ('node A leads every group, nodes B and C lead none ... max(count) - min(count) is N - N = 0 ... never fires'), and gates the PR2 convergence test (§5) on this seed so the policy fires on the very imbalance the scheduler exists to fix. Voters with Suffrage == "voter" are seeded; learners are excluded because handleTransferLeadership rejects them as targets.

chatgpt-codex-connector · 2026-06-11T05:08:24Z

+
+A group is **eligible** for a transfer this cycle only if **all** of the following hold; otherwise it is skipped (and counted in a skip metric):
+
+- **No conf-change in flight.** Skip a group whose membership is changing — a transfer racing a conf-change can land on a member about to be removed. The cleanest in-tree signal (resolving **OQ-5**) is to **surface the existing per-group pending-config state on the leader through the exported `Status`**: the etcd backend already tracks `pendingConfigs` (`internal/raftengine/etcd/engine.go:390`, `:563`), but it is not on the public `raftengine.Status` struct (`internal/raftengine/engine.go:67-83`). The recommendation is to add a `PendingConfChange bool` field to `Status` populated from `len(e.pendingConfigs) > 0` (mirroring how `LeadTransferee` was added to the struct), rather than leaking etcd internals via `rawNode.BasicStatus().Config.PendingConfIndex` past the engine boundary or adding a dedicated `Admin` method. This is only observable on the group leader (where conf-changes are proposed), which fits the §3.4 ownership rule — in case 2 the executing remote leader sees it, and the balancer's forwarded transfer will be rejected if the leader has a conf-change pending.


Reject forwarded transfers during pending config changes

This says the case-2 forwarded transfer will be rejected when the remote group leader has a pending conf-change, but the current RaftAdmin.TransferLeadership path just calls TransferLeadershipToServer, and handleTransferLeadership only checks leader state/raft acceptance, not pendingConfigs. Surfacing PendingConfChange in Status is not enough for remote-source groups unless the design also adds a server-side guard or an internal RPC check; otherwise the scheduler can still transfer during a pending Add/Remove, violating this safety exclusion.

Useful? React with 👍 / 👎.

bootjp · 2026-06-11T05:35:49Z

Addressed review round 2 in c08e7c8 (doc-only). Per-finding dispositions:

codex P1 — "Seed zero-leader voters in the count map" (ACCEPTED). §3.2 now mandates seeding the leader-count map with count 0 for every voter node ID across all co-hosted groups before counting observed leaders. Without the seed, the very state the scheduler exists to fix (A leads all groups, B/C lead none) builds {A: N}, so max-min == 0 and the policy never fires. Voter set is read locally with zero RPC via rt.snapshotEngine().Configuration(ctx) — verified this returns the local cached ConfState-derived Configuration{Servers []Server} on every node (leader or follower) at internal/raftengine/etcd/engine.go:1242-1250; filter Suffrage == "voter" (SuffrageVoter, internal/raftengine/etcd/peers.go:27); learners excluded. No-leader groups are excluded from counting but their voters still seed the map. PR2 scope + §5 convergence-test note updated.

codex P2 — "Reject forwarded transfers during pending config change" (ACCEPTED). Verified the claim: RaftAdmin.TransferLeadership → TransferLeadershipToServer → handleTransferLeadership checks only RaftState != StateLeader and raft's transfer acceptance — it does not consult pendingConfigs (internal/raftengine/etcd/engine.go:1743-1771). So surfacing PendingConfChange in Status (OQ-5) only helps decision-time, and in case 2 the balancer is a follower whose local Status can't see the remote leader's pendingConfigs. §3.5 now splits enforcement into (1) decision-time Status.PendingConfChange observability (OQ-5) and (2) an execution-time guard on the group leader: handleTransferLeadership rejects with a new errLeadershipTransferConfChangePending when len(e.pendingConfigs) > 0, before rawNode.TransferLeader. New OQ-12 records guard placement (engine-loop guard preferred — also protects the case-1 local call and cmd/raftadmin). Case-2 bullet, five-lens concurrency row, and PR2 scope updated.

claude #1 — §1.1a self-contradicts on PR1 blocking (ACCEPTED). PR1 is observation-only and issues no transfers, so it is not blocked on PR0. §1.1a now reads "Only the transfer-issuing scheduler PRs (PR2–PR3) are blocked on PR0; PR1 ships even against today's single-voter topology." Consistent with the PR1 milestone row and §9 lifecycle.

claude #2 — case-1 "may" → "should" (ACCEPTED). §3.5 case-1: in case 1 the balancer is the leader and holds the live Progress map, so it is the most accurate place to filter and should pre-filter (skipping it risks a write stall on a lagging target).

claude #3 / coderabbit Critical — concrete catch-up threshold (ACCEPTED). §3.5 case-1 now gives implementable numbers: caught-up = Progress.Match >= CommitIndex - maxLag (--leaderBalanceMaxTargetLag = 1024 entries); liveness = Status.LastContact < electionTimeout/2 (--leaderBalanceMaxTargetLastContact); learners excluded. Replied to the coderabbit Critical thread with the resolution.

claude #4 — OQ-3 recommendation (ACCEPTED). Adopted: balance the default group LAST — included in the eligible-source set only once every other group's source node is ≤ ⌈groups/nodes⌉; operator pin via --leaderBalancePinGroups. Added to §3.5 and §8 OQ-3.

claude #5 — OQ-9 recommendation (ACCEPTED). Adopted option (a) — extend the bootstrap/flag surface (lift the len(groups)==1 guard at main.go:742-748, per-group peer lists) as PR0's path; AddVoter/PromoteLearner composition stays valid for live expansion but is not the gating prerequisite. Added to §1.1a and §8 OQ-9.

claude #6 — homogeneous-membership constraint (ACCEPTED). New §2.2 Non-Goal: heterogeneous group membership across nodes is out of scope for v1; the scheduler assumes every default-group participant is a member of every other group.

claude #7 — leaderbalance_unobservable_groups gauge (ACCEPTED). Added to §3.6 gauges so a "balanced" sum is distinguishable from "partially observed."

claude #8 — forwarded-path integration test (ACCEPTED). §5 adds a case-2 scenario: force the imbalance on a non-default-group-leader node, assert the balancer forwards over RaftAdmin and converges, and assert the TOCTOU/rejection mappings.

claude #9 — "PR-doc" label (ACCEPTED). Milestone row relabeled "Doc lifecycle (this PR → *_partial_* after PR1; → *_implemented_* after PR3)".

gemini line 55/86 (already addressed at prior HEAD, no change needed). Transfer-storm → §3.1 startup grace; forwarded-auth gap → §3.4 security posture + OQ-11.

Status stays Proposed. Open Questions renumbered consistently (now OQ-1…OQ-12).

@claude review

claude · 2026-06-11T05:36:14Z

Claude finished @bootjp's task in 5m 28s —— View job

Review: `docs/design/2026_06_11_proposed_leader_balance_scheduler.md` (round 3 — `c08e7c8`)

Gather context and understand the request
Read the updated design document (c08e7c8)
Verify code citations and implementation accuracy
Review round 2 disposition completeness
Check for new issues introduced by revisions
Post comprehensive review

Overall: Accept with targeted corrections. All 13 round-2 dispositions were applied correctly (claude #1–#9, codex P1/P2, gemini items, coderabbit Critical). The document is now substantially complete and well-grounded in the codebase. Two substantive accuracy issues were introduced by the new §3.5 eligibility thresholds, one of which (the LastContact semantics on the leader) will cause the case-1 liveness gate to be a no-op if implemented as written. These need correction before PR2 begins.

Must-fix before PR2 starts

1. Status.LastContact is always 0 on the leader — case-1 liveness gate is ineffective as specified

§3.5 case-1 specifies: "Recent contact (liveness): Status.LastContact < electionTimeout / 2, using the already-exported LastContact time.Duration field on raftengine.Status (internal/raftengine/engine.go:77)."

This is incorrect. lastContactFor returns 0 unconditionally when state == StateLeader (internal/raftengine/etcd/engine.go:3481-3483):

func lastContactFor(state raftengine.State, ...) time.Duration {
    if state == raftengine.StateLeader {
        return 0
    }
    ...
}

In case 1, the balancer host IS the leader of the group. Its own Status().LastContact is therefore always 0. Since 0 < electionTimeout / 2 is trivially true for any positive timeout, the liveness gate never filters anything — every target node appears "recently contacted" regardless of whether it is partitioned or dead.

LastContact measures the time since a follower last heard from its leader — it is not a per-peer view from the leader. The actual per-peer liveness the case-1 filter needs (Progress.RecentActive, or the most recent MsgHeartbeatResp timestamp for a given target node) is not exposed by the current raftengine.Status. The catch-up guard (Progress.Match >= CommitIndex - maxLag) remains valid since rawNode.Status().Progress is available on the leader, but LastContact should be replaced or the statement corrected. Options:

a. Drop the LastContact criterion for case 1 and rely solely on Progress.Match >= CommitIndex - maxLag; note that Progress.RecentActive (from rawNode.Status().Progress[peerID].RecentActive) is the correct per-peer liveness signal on the leader and is accessible if the filter runs inside the engine. Note this as a PR2 implementation note.
b. Expose a new per-peer contact field — add PeerLastContact map[uint64]time.Duration to Status (leader-only; only populated when state == StateLeader, from rawNode.Status().Progress). This pairs naturally with the PendingConfChange addition under OQ-5 and surfaces the signal the doc intends.

Correct §3.5 case-1 before PR2 implementation picks up the spec literally. Fix this →

2. handleTransferLeadership conf-change guard requires e.pending.Lock() — event-loop consistency claim is incomplete

§3.5 says: "handleTransferLeadership (which runs on the single-threaded event loop, where e.pendingConfigs is consistent)" — and the OQ-12 rationale repeats this.

This is incomplete. cancelPendingConfig is called from a non-event-loop goroutine when the admin context is cancelled (etcd/engine.go:1366-1367):

case <-ctx.Done():
    e.cancelPendingConfig(req.id)   // ← called from the ctx.Done() select arm, not the event loop
    return adminResult{}, ...

cancelPendingConfig takes e.pending.Lock() to delete from e.pendingConfigs. So pendingConfigs CAN be mutated by a goroutine other than the event loop, and reading len(e.pendingConfigs) in handleTransferLeadership without e.pending.Lock() is a data race under -race. The proposed guard must acquire e.pending.Lock():

// Must hold e.pending before reading pendingConfigs — cancelPendingConfig
// can delete entries from outside the event loop on ctx.Done().
e.pending.Lock()
hasPending := len(e.pendingConfigs) > 0
e.pending.Unlock()
if hasPending {
    req.done <- adminResult{err: errors.WithStack(errLeadershipTransferConfChangePending)}
    return
}

Update §3.5 and OQ-12 to note this requirement. The same applies to refreshStatus() when OQ-5's PendingConfChange bool is added: it must also snapshot len(e.pendingConfigs) under e.pending.Lock() before writing the cached e.status. Fix this →

Should address before PR2

3. Progress.Match exposure path for case-1 filter is deferred without a concrete resolution

§3.5 case-1 says "requires either reading it inside the engine (case 1 runs on the local leader) or a small accessor — tracked under the same Progress-exposure note as case 2." Two concurrent implementors could reach opposite solutions. Give a concrete recommendation now: the "inside-the-engine" option is preferable — instead of exposing raw per-peer progress, add a thin method like TransferLeadershipToServerIfEligible(ctx, targetID, addr string, maxLag uint64) error that embeds the Progress.Match + Progress.RecentActive check on the executing leader's rawNode. This keeps Progress internal, prevents callers from constructing their own (potentially stale) eligibility decisions, and defers the question of whether Progress.RecentActive is a reliable liveness proxy to the engine boundary. The accessor option adds a public contract that is harder to evolve. Add a recommendation to §3.5 or create OQ-13.

4. maxLag = 1024 entries rationale is throughput-dependent

The doc says 1024 entries is "roughly the amount a healthy follower can fall behind within one election timeout under normal append throughput." This derivation is only correct at a specific throughput. Under high write load (say 100k TPS with tiny entries), 1024 entries is a tiny fraction of one election timeout; the gate would reject almost all targets. State instead: "1024 is a conservative default that keeps transfer windows short at moderate throughputs; operators running high-throughput workloads should tune --leaderBalanceMaxTargetLag to raftElectionTimeoutTicks × tickInterval × raftMaxEntriesPerMsg entries." The flag exists — just explain the correct derivation for large deployments.

What's well done in this revision

Voter-map seeding (§3.2): the analysis is correct and complete — builds seed from Configuration(ctx) on every node (leader or follower), filters by Suffrage == SuffrageVoter, seeds at zero before incrementing. The consequence (without the seed, the PR2 convergence test cannot pass) is spelled out clearly. Verified against internal/raftengine/etcd/engine.go:1242-1250 and internal/raftengine/etcd/peers.go:27.
Execution-time conf-change guard (OQ-12): the architecture is correct — decision-time Status.PendingConfChange is insufficient for case 2 because the balancer's follower-local view doesn't see the remote leader's pendingConfigs. The two-layer enforcement (decision-time for case 1 observability + execution-time guard in handleTransferLeadership) is the right design. The lock issue above is an implementation detail, not a design flaw.
OQ-3 "balance last" recommendation: well-reasoned. Deferring the default group avoids the recursive balancer-host-churn risk (§3.1) until all cheaper moves are exhausted. The condition (every other group's source node at or below ⌈groups/nodes⌉) is concrete and implementable.
OQ-9 recommendation: option (a) (extend bootstrap flag surface) correctly diagnoses why AddVoter composition is insufficient as the gating prerequisite (every test harness would need a per-test dance). The len(groups)==1 guard location (main.go:742-748) is accurately cited.
Forward-path TOCTOU: the new case-2 integration test scenario (force imbalance on a non-default-group-leader, assert RaftAdmin forward path is exercised and TOCTOU rejections are logged skips) closes the test-coverage gap from round 1.

| Branch

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c08e7c8f55

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T05:36:56Z

+      - **Recent contact (liveness):** `Status.LastContact < electionTimeout / 2`, using the already-exported `LastContact time.Duration` field on `raftengine.Status` (`internal/raftengine/engine.go:77`). Half an election timeout is the standard liveness margin: a follower the leader has heard from within that window is very unlikely to be partitioned or dead, so the transfer will not stall on an unreachable transferee. The bound is operator-tunable via `--leaderBalanceMaxTargetLastContact` (default `electionTimeout / 2`).
+      - **Learners excluded** before submission (and `handleTransferLeadership` rejects them as "not a voter" as a backstop, `internal/raftengine/etcd/engine.go:1765-1768`).
+      Both thresholds are constants available at scheduler construction time and are plumbed as the two flags above. If a future revision prefers to lean entirely on the submit-time rejection even in case 1, that must be stated explicitly; v1 applies the pre-filter in case 1.
+    - **Case 2 (forwarded to a remote leader):** the balancer cannot pre-filter target catch-up *or* see the remote leader's pending conf-change locally — both are enforced on the executing leader. It forwards its chosen target (and, in PR3, optionally an ordered preference list) to the source-group leader; the executing leader rejects (1) a transfer racing a membership change via the conf-change guard of the "No conf-change in flight" bullet above (`errLeadershipTransferConfChangePending`), and (2) an un-caught-up target at submit time (`errLeadershipTransferRejected` when raft drops the transfer because the target has no/insufficient progress, `:1765-1768`); `waitForLeadershipTransfer` fails closed if the transfer later aborts (`:1384-1405`, `:1431-1437`). The balancer maps these outcomes to a logged skip + `leaderbalance_skipped_total{reason=conf_change}` / `leaderbalance_transfers_failed_total{reason=rejected|aborted}` and tries again next cycle.


Add a real remote catch-up guard before transfer

In the forwarded case where the least-loaded target is a lagging voter, this relies on TransferLeadershipToServer returning errLeadershipTransferRejected for insufficient progress, but etcd/raft v3.6 accepts a voter with a Progress entry, sets leadTransferee, and only sends append traffic until the target catches up or times out. Because PR2 only plans Progress exposure for the case-1 local filter, following this design can stall writes on the remote-led group for an election timeout whenever the balancer picks a stale follower; the executing leader needs an explicit Progress/health check before calling TransferLeader.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-11T05:36:56Z

+  - **Concretely:**
+    - **Case 1 (balancer host leads the group):** the balancer *does* have the live Progress map for that group, so it **should** pre-filter the preferred target before calling `TransferLeadershipToServer` locally — in case 1 the balancer *is* the leader, so it is the most accurate place to filter, and skipping the filter would expose the group to a write stall if a lagging target were chosen and raft had to wait for it to catch up. The concrete eligibility predicate (no longer "close enough" hand-waving):
+      - **Caught up:** `target.Progress.Match >= leader.CommitIndex - maxLag`, where **`maxLag` is in log-entry units** with default **`--leaderBalanceMaxTargetLag = 1024` entries**. Rationale: this is roughly the amount a healthy follower can fall behind within one election timeout under normal append throughput, so a target at most one election-timeout's worth of log behind is still able to complete the transfer quickly once it becomes leader-transferee; tighter values risk never finding an eligible target under steady write load, looser values risk a longer write stall on the new leader's group while raft waits. `Progress.Match` is the leader-side per-peer matched index already maintained by etcd/raft (used in-tree by `promoteLearner`, `internal/raftengine/etcd/engine.go:1679`); exposing it to the case-1 filter requires either reading it inside the engine (case 1 runs on the local leader) or a small accessor — tracked under the same Progress-exposure note as case 2.
+      - **Recent contact (liveness):** `Status.LastContact < electionTimeout / 2`, using the already-exported `LastContact time.Duration` field on `raftengine.Status` (`internal/raftengine/engine.go:77`). Half an election timeout is the standard liveness margin: a follower the leader has heard from within that window is very unlikely to be partitioned or dead, so the transfer will not stall on an unreachable transferee. The bound is operator-tunable via `--leaderBalanceMaxTargetLastContact` (default `electionTimeout / 2`).


Do not use leader Status.LastContact for target liveness

For the local-leader case, this predicate cannot identify whether the target follower is recently reachable: Status.LastContact is a single field on the local engine, and lastContactFor returns 0 whenever that local engine is StateLeader. In a case-1 transfer the balancer is reading the leader's status, so this liveness check always passes even if the chosen target is partitioned; the design should instead expose per-peer leader-side activity/Progress or drop this as a claimed filter.

Useful? React with 👍 / 👎.

- codex P1: etcd/raft v3.6 accepts a transfer to any lagging voter (sets leadTransferee, pauses proposals) rather than rejecting it; add an engine-side catch-up guard (Progress.Match >= Commit - maxLag) inside handleTransferLeadership before rawNode.TransferLeader, exposed via a new TransferLeadershipToServerIfEligible method used by BOTH case 1 (local) and case 2 (remote leader). Progress exposure lands at the guard in PR2. - codex P2 / claude #1: Status.LastContact is 0 on a leader (lastContactFor returns 0 for StateLeader) and tracks only this node's contact with ITS leader — drop it as the liveness gate; use leader-side per-peer Progress.RecentActive. Remove --leaderBalanceMaxTargetLastContact. - claude #2: the conf-change guard must read len(e.pendingConfigs) under e.pending.Lock() (cancelPendingConfig mutates off the event loop); same for the OQ-5 PendingConfChange Status snapshot. - claude #3: resolve the Progress-exposure hand-wave → engine method, not exported Progress (new OQ-13). - claude #4: maxLag=1024 rationale corrected to throughput-dependent with a tuning formula. Updated §1.3, §3.4, §3.5, §3.6, §4 (PR2), §6, OQ-5/10/12 and added OQ-13 consistently. Status stays Proposed. Doc-only.

bootjp · 2026-06-11T05:49:01Z

Round 3 addressed — pushed ef5dbd9b (doc-only). Dispositions below; every code claim re-verified at file:line against the in-tree go.etcd.io/raft/v3 v3.6.0 and internal/raftengine/etcd/engine.go.

codex P1 — "Add a real remote catch-up guard before transfer" — ACCEPTED (correct)

Verified: etcd/raft v3.6 (*raft).Step on MsgTransferLeader (raft.go:1631-1662) returns early only for a learner (:1632-1634), self (:1647-1649), or an in-progress transfer to the same node (:1639-1642). For any other voter — including a far-behind one — it sets r.leadTransferee (:1655) and sendAppends (:1660), and while leadTransferee != None the leader drops every proposal with ErrProposalDropped (raft.go:1296-1299). So a lagging target stalls writes for up to the transfer timeout. The in-tree handleTransferLeadership rejection (engine.go:1765-1768) fires only when raft silently dropped the request (learner/self/no-Progress) — not for a merely-behind voter (its LeadTransferee == target check passes). handleTransferLeadership/resolveTransferTarget inspect no Progress.Match (only handlePromoteLearner does, :1674-1688).

Fix: the executing group leader applies the §3.5 catch-up predicate (Progress.Match >= Commit - maxLag + RecentActive) inside handleTransferLeadership before rawNode.TransferLeader, via a new gated method TransferLeadershipToServerIfEligible(ctx, id, addr, maxLag) (mirrors the existing handlePromoteLearner Progress.Match gate), returning a new errLeadershipTransferTargetNotCaughtUp. This guard runs in both case 1 (local engine) and case 2 (remote leader, via the forwarded RaftAdmin call routed to the gated form) — so Progress exposure lands at the guard in PR2, not as a case-1-only local pre-filter. Updated §1.3, §3.4, §3.5, §4 (PR2), §6, OQ-10/OQ-12, added OQ-13.

codex P2 — "Do not use leader Status.LastContact for target liveness" — ACCEPTED (correct)

Verified: lastContactFor returns 0 when state == StateLeader (engine.go:3481-3483), and it derives from a single field (e.lastLeaderContactFrom / e.lastLeaderContactAt, recordLeaderContact :3540-3546) describing this follower's contact with ITS leader — there is no per-peer leader→follower tracking. A case-1 balancer runs on the leader, so its own Status().LastContact is always 0 and 0 < electionTimeout/2 is trivially true even for a partitioned target.

Fix: drop Status.LastContact as the liveness gate; use leader-side per-peer Progress.RecentActive (go.etcd.io/raft/v3 tracker/progress.go:87-91), which lives in the same leader-only Status.Progress map as Match. Removed the --leaderBalanceMaxTargetLastContact flag. Updated §3.5 and §3.6.

claude review round 3 (job finished, folded in)

Configure Renovate #1 (LastContact ineffective on leader) — same root cause as codex P2; converged into one fix above.
Update dependency go to v1.21.4 #2 (conf-change guard needs e.pending.Lock()) — ACCEPTED. Verified cancelPendingConfig (:3158-3165) takes e.pending.Lock() and is called from a non-event-loop goroutine (the ctx.Done() arm of submitAdminEx, :1366-1367), so reading len(e.pendingConfigs) on the event loop without the lock races under -race. §3.5, OQ-5, and OQ-12 now require the read under e.pending.Lock() (guard + the Status.PendingConfChange snapshot).
Update docker/setup-qemu-action action to v3 #3 (concrete Progress-exposure path) — ACCEPTED. Resolved as the engine method TransferLeadershipToServerIfEligible (keep Progress internal, no exported map); recorded as OQ-13 and threaded through §3.5/§4.
Dependency Dashboard #4 (maxLag=1024 throughput-dependent) — ACCEPTED. §3.5 now states 1024 is a conservative moderate-throughput default and gives the high-throughput tuning formula (raftElectionTimeoutTicks × tickInterval × raftMaxEntriesPerMsg).

No findings rebutted. Status: Proposed unchanged; doc-only, no Go tests.

@claude review

claude · 2026-06-11T05:49:13Z

Claude finished @bootjp's task in 7m 10s —— View job

Review: `docs/design/2026_06_11_proposed_leader_balance_scheduler.md` (round 4 — `ef5dbd9`)

Gather context and understand the request
Read the updated design document (ef5dbd9)
Verify code citations in the codebase
Check round 3 disposition correctness
Look for new issues introduced by revisions
Post comprehensive review

Overall: Accept with one targeted fix. All round-3 dispositions were applied correctly and the code citations check out. One substantive gap was introduced by the round-3 additions: the maxLag parameter carried by TransferLeadershipToServerIfEligible has no wire representation in the forwarded-path proto, leaving the case-2 forwarded path without a concrete routing mechanism. This is a must-fix before PR2 can ship the forwarded-path catch-up gate. Everything else is accurate and implementation-ready.

Must-fix before PR2 ships the forwarded-path `IfEligible` routing

1. maxLag has no wire representation in the case-2 forwarded path — OQ-4 must close before PR2

The design introduces TransferLeadershipToServerIfEligible(ctx, id, addr, maxLag) as the engine method that applies the Progress.Match >= Commit - maxLag + Progress.RecentActive guard. For case 2 (forwarded path), this guard must run on the remote source-group leader. But the current RaftAdminTransferLeadershipRequest proto carries only target_id and target_address (confirmed: proto/service.proto:248-251); the server handler maps to un-gated TransferLeadershipToServer (internal/raftadmin/server.go:163). There is no max_lag field on the wire.

Without resolving this, PR2 can only achieve case-2 forwarding via one of:

Un-gated path (safety regression): forward to un-gated TransferLeadershipToServer on the remote leader — no catch-up check, same lagging-target stall risk that codex P1 identified and PR2 is supposed to fix. This contradicts the PR2 scope item: "forwarded path (case 2, single preferred target, leader-validated, §3.5)."
A proto change (requires OQ-4 resolution first): any form of the IfEligible routing requires either extending the existing proto or a new RPC — both require OQ-4 to be closed before the proto/interface work can start.

OQ-4 is recorded as open, but PR2 claims to include the leader-validated forwarded path. These two facts contradict each other. OQ-4 must be resolved (or PR2's forwarded-path scope narrowed) before implementation starts.

Concrete recommendation to add to OQ-4 and §3.4:

Extend RaftAdminTransferLeadershipRequest with an optional uint64 max_lag = 3 field (proto3, backward-compatible: absent/0 means "un-gated"). When max_lag > 0, the server handler routes to TransferLeadershipToServerIfEligible; when 0 (the existing default for all cmd/raftadmin calls), it routes to the un-gated TransferLeadershipToServer. The balancer's case-2 forward client always populates max_lag = --leaderBalanceMaxTargetLag. This keeps cmd/raftadmin ungated for operator use (disaster-recovery force-transfer to a lagging follower remains possible by passing 0 or leaving the field absent), avoids a new RPC method, and avoids the implicit assumption that all cluster nodes share the same --leaderBalanceMaxTargetLag value.

If a different option is preferred (e.g., remote applies its own configured maxLag, or a new RaftAdmin.TransferLeadershipIfEligible RPC), state it in OQ-4 and note the proto/interface implications. What matters is that OQ-4 has a concrete decision before PR2.

Should resolve to avoid implementation ambiguity

2. "Changing the existing handler" sub-option silently gates cmd/raftadmin

§3.5 offers two sub-options for routing the case-2 forwarded call to the gated form: "(a) changing the existing handler to apply the guard" or "(b) via the OQ-4 purpose-built internal RPC carrying maxLag." The design separately says "The pre-existing TransferLeadershipToServer (no catch-up gate) stays for the manual operator path (cmd/raftadmin)." These are contradictory if option (a) applies the guard unconditionally — a cmd/raftadmin leadership_transfer_to_server call would also get the catch-up gate, blocking a legitimate operator transfer to a lagging follower (e.g. disaster recovery: force leadership to the only surviving node, which may be behind). If the proto-extension (finding #1 above) is adopted, this tension is resolved automatically (0 = ungated for operators, non-zero = gated for the balancer). If any other option is chosen, the cmd/raftadmin behavioral impact must be stated explicitly in §3.4 rather than left implicit.

What's well done in this revision

lastContactFor/leader-zero analysis (codex P2 / claude Configure Renovate #1): verified against :3481-3483 — the StateLeader → return 0 branch is exactly as described. The substitution of Progress.RecentActive (leader-side per-peer signal) is correct and avoids the need for any new flag.
cancelPendingConfig race / e.pending.Lock() analysis (claude Update dependency go to v1.21.4 #2): verified against :1366-1367 (non-event-loop ctx.Done() arm calls cancelPendingConfig) and :3158-3165 (cancelPendingConfig takes e.pending.Lock()). The analysis is precise: the event loop alone is not sufficient for pendingConfigs consistency, and the lock requirement correctly propagates to both the handleTransferLeadership guard AND the refreshStatus/Status.PendingConfChange snapshot path (OQ-5).
TransferLeadershipToServerIfEligible engine method (OQ-13): the design mirrors handlePromoteLearner's in-engine Progress.Match gate (:1674-1688) exactly — same goroutine, same rawNode.Status().Progress access pattern, no exported map. The method is the correct abstraction.
maxLag = 1024 tuning note (claude Dependency Dashboard #4): the correction from "one election timeout of log" to "conservative at moderate throughput; scale toward raftElectionTimeoutTicks × tickInterval × raftMaxEntriesPerMsg for high-throughput deployments" is accurate and avoids misleading implementers who operate under high write load.
Ownership rule (exhaustive, §3.5): "target catch-up and liveness validation is the responsibility of the node that executes the transfer — which is always the group leader" — the two sub-cases (case 1 local, case 2 remote) now consistently route through the engine-side guard rather than splitting local-pre-filter vs. submit-time-rejection.
leaderbalance_unobservable_groups companion gauge (§3.6): correctly distinguishes a balanced cluster from a partially-observed one. The "voters still seed the map even if the group is excluded from counting" rule (§3.2) is the correct complement.
Milestone table (§4) cleanup: "Doc lifecycle (this PR → *_partial_* after PR1; → *_implemented_* after PR3)" correctly disambiguates that the first rename happens from this PR, not a separate PR.

| Branch

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ef5dbd9b9d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T05:50:56Z

+
+- **No conf-change in flight — enforced on the executing leader, not only at the balancer's decision time.** Skip a group whose membership is changing — a transfer racing a conf-change can land on a member about to be removed. Two distinct enforcement points are required, because the balancer's decision-time read and the transfer's execution can happen on different nodes (case 2, §3.4):
+  - **Decision-time observability (resolves OQ-5).** Surface the existing per-group pending-config state through the exported `Status`: the etcd backend already tracks `pendingConfigs` (`map[uint64]adminRequest`, guarded by `e.pending`, `internal/raftengine/etcd/engine.go:390`, `:563`, populated in `storePendingConfig` `:3138`), but it is not on the public `raftengine.Status` struct (`internal/raftengine/engine.go:67-83`). Add a `PendingConfChange bool` field to `Status` populated from `len(e.pendingConfigs) > 0` (mirroring how `LeadTransferee` was added to the struct), rather than leaking etcd internals via `rawNode.BasicStatus().Config.PendingConfIndex` past the engine boundary or adding a dedicated `Admin` method. This lets the balancer skip a group whose leader it *can* see has a pending conf-change. In **case 1** (the balancer host leads the group) this read is local and authoritative.
+  - **Execution-time guard on the group leader (required for case 2).** Decision-time `Status` is **not sufficient for the forwarded path**: in case 2 the balancer is a *follower* of the target group, so its local `Status().PendingConfChange` reflects only the local node and not the remote leader's `pendingConfigs`. And the current `RaftAdmin.TransferLeadership → TransferLeadershipToServer → handleTransferLeadership` path checks only leader state and raft's transfer acceptance (`internal/raftengine/etcd/engine.go:1743-1771`) — it does **not** consult `pendingConfigs`. Surfacing the field in `Status` alone therefore does not stop a forwarded transfer from racing an Add/Remove on the remote leader. PR2 must add a **server-side guard on the executing leader**: `handleTransferLeadership` rejects the transfer with a new `errLeadershipTransferConfChangePending` when `len(e.pendingConfigs) > 0`, before calling `rawNode.TransferLeader`. **The read must acquire `e.pending.Lock()` (claude round-3 #2):** running on the event loop is *not* by itself enough to make `e.pendingConfigs` consistent for a reader, because `cancelPendingConfig` deletes entries under `e.pending.Lock()` from a **non-event-loop goroutine** — the `ctx.Done()` arm of `submitAdminEx` (`internal/raftengine/etcd/engine.go:1366-1367`, `cancelPendingConfig` at `:3158-3165`). Reading `len(e.pendingConfigs)` without the lock is a data race under `-race`. The guard therefore snapshots `hasPending := len(e.pendingConfigs) > 0` under `e.pending.Lock()` and rejects when true. The same locking applies to the `PendingConfChange bool` field on `Status` (OQ-5): the `refreshStatus`/`Status()` snapshot path must read `len(e.pendingConfigs)` under `e.pending.Lock()` before writing the cached `e.status`. The balancer maps the rejection to `leaderbalance_skipped_total{reason=conf_change}` and retries next cycle. This makes the conf-change exclusion enforceable regardless of which node executes the transfer, which fits the §3.4 ownership rule (the executing leader owns the safety gate). **OQ-12** records the guard placement (in `handleTransferLeadership` vs. a check in the `RaftAdmin.TransferLeadership` server wrapper).


Check raft's pending config, not request-local state

When a conf-change is proposed and leadership changes before that entry is applied, the new leader will not have the old leader's pendingConfigs entry because this map is populated only by the local admin request and resolved by the encoded request id. etcd/raft still carries the pending conf-change across the new leader (pendingConfIndex is set conservatively on leadership), so this guard would allow a transfer during exactly the in-flight membership change it is meant to exclude. The execution-time check needs to inspect raft's pending/unapplied conf-change state inside the engine, not only len(e.pendingConfigs).

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-11T05:50:56Z

+|---|---|---|---|
+| **PR0 (prerequisite, §1.1a)** | **Multi-node multi-group topology.** Either (a) extend the bootstrap/flag surface so a group's voter set can span more than one node at startup (per-group members in `--raftGroups` or a companion `--raftBootstrapMembers`-style flag that lifts the `len(groups)==1` guard, `main.go:742-748`), or (b) document + harness the runtime `AddVoter`/`PromoteLearner` composition path (`internal/raftengine/engine.go:225-233`) as the supported way to build multi-voter groups, with an integration topology (extend `scripts/run-jepsen-m5-local.sh`'s single-member-per-group model, `:5-20`) that stands up groups with voters on ≥2 nodes. **OQ-9** decides (a) vs (b). | Bootstrap/flag parsing tests (or AddVoter-composition harness); integration smoke that a group has voters on ≥2 distinct nodes and `TransferLeadershipToServer` succeeds between them. | Yes — unblocks the scheduler; useful on its own (true multi-node multi-group). |
+| **PR1** | Leader-count **observation + metrics only**: a leader-identity reader sibling to `publishLeaderTerms` (`main.go:2126-2143`) that builds the per-node leader-count map on the default-group leader, and exports the `leaderbalance_leaders_per_node` gauge + `leaderbalance_enabled=0`. **No transfers.** Pure observability — shippable even before PR0 (it just observes a trivially-balanced single-voter topology). | Unit: leader-map construction from a fake set of per-group `State()`/`Leader()`; gauge registration. | Yes — observe-only, zero behavior change. |
+| **PR2** (blocked on PR0) | The pure **policy function** (§3.3) + the **scheduler loop** + **transfer execution** behind `--leaderBalance` (default OFF) + runtime kill switch + startup grace (§3.1) + slog + the transfer-result metrics. Local-leader transfer path (case 1, §3.4) and forwarded path (case 2, single preferred target, leader-validated, §3.5). Eligibility: leader-count map **seeded with all voters at zero** (§3.2) + "no-leader skip" + per-group/global cooldowns + startup grace + conf-change skip + `LeadTransferee != 0` skip. The transfer **executes on the group leader**, which owns **all** target catch-up + liveness validation via a **new engine-side guard** — add `TransferLeadershipToServerIfEligible(ctx, id, addr, maxLag)` to the engine `Admin` interface; its `handleTransferLeadership` body checks `Progress.Match >= Commit - maxLag` **and** `Progress.RecentActive` against the leader-only `e.rawNode.Status().Progress[target]` *before* `rawNode.TransferLeader`, returning `errLeadershipTransferTargetNotCaughtUp` (mirrors `handlePromoteLearner`, §3.5). **Progress exposure lands here at the guard, applied identically to case 1 (local engine) and case 2 (remote leader) — there is no separate case-1-only `Status.Progress` exposure.** The same `handleTransferLeadership` also gets the execution-time conf-change guard (rejects when `len(e.pendingConfigs) > 0`, read **under `e.pending.Lock()`** — `cancelPendingConfig` mutates from a non-event-loop goroutine, §3.5 / OQ-12) plus the `PendingConfChange` field on `Status` (also read under `e.pending.Lock()`, OQ-5). The forwarded path must route `RaftAdmin.TransferLeadership` to the gated `…IfEligible` form (§3.5). | Unit (table-driven): policy decisions over crafted leader maps (imbalance threshold, source/target choice, tie-breaks, strict-spread-decrease guard, cooldown gating, startup-grace gating). Engine-level: `TransferLeadershipToServerIfEligible` rejects a lagging / not-`RecentActive` target *before* `rawNode.TransferLeader` (no proposals paused) and the conf-change guard rejects under a pending conf-change — both run under `-race`. Integration: **the PR0 multi-voter-group topology** — force all leaders onto one node, enable `--leaderBalance`, assert convergence to ≤ `⌈groups/nodes⌉` per node. Kill-switch + leader-change-reset + startup-grace tests. | After PR0 — completes count-based balancing. |


Move SQS refusal exclusion into first transfer PR

In deployments with partitioned FIFO groups, PR2 is marked as completing count-based balancing and can be enabled before PR3, but its eligibility list omits the SQS/refusal exclusion that this document says is required to avoid guaranteed ping-pong when a group is transferred to a node that immediately refuses it. Either include the partitioned-FIFO/pinning exclusion in the first transfer-issuing milestone or mark PR2 as not independently usable for clusters with those groups.

Useful? React with 👍 / 👎.

bootjp · 2026-06-11T06:33:28Z

@claude review

Round-4 dispositions (pushed 1304cfb3, doc-only). All code claims re-verified at file:line.

Finding #1 — maxLag has no wire representation in the case-2 forwarded path (must-fix, OQ-4): ACCEPTED + resolved.
Verified the gap exactly as reported: RaftAdminTransferLeadershipRequest carries only target_id = 1 / target_address = 2 (proto/service.proto:248-251), and the handler maps unconditionally to the un-gated TransferLeadershipToServer (internal/raftadmin/server.go:163). Next free field number is 3.
Adopted the recommended proto extension as the single resolution:

§3.4 now specifies adding optional uint64 max_lag = 3; the server handler routes max_lag > 0 → gated TransferLeadershipToServerIfEligible, max_lag == 0 → un-gated TransferLeadershipToServer. The balancer's case-2 client always sets max_lag = --leaderBalanceMaxTargetLag. Backward-compatible by proto3 construction (absent ⇒ 0 ⇒ ungated); no cmd/raftadmin change (parseTransferTarget leaves the field unset, cmd/raftadmin/main.go:370-378). Noted the version-pinned regen (proto/Makefile:1-3).
OQ-4 rewritten from open to resolved → extend the proto; the future auth/audit operator-vs-automated split is the only part left open (couples with OQ-11), and it no longer blocks PR2's forwarded path.
OQ-13 / §3.5 / PR2 row updated so case 1 (local engine) and case 2 (remote leader via max_lag = 3) share one engine method.

Finding #2 — "change the existing handler" sub-option silently gates cmd/raftadmin (should-resolve): ACCEPTED.
Removed the contradictory two-sub-option language from §3.5 ("(a) change the handler unconditionally / (b) OQ-4 internal RPC"). There is now exactly one path: the max_lag field selects gated vs. ungated. Added an explicit engine-boundary note: both TransferLeadershipToServer and …IfEligible submit the same adminActionTransferLeadership onto handleTransferLeadership (etcd/engine.go:1325-1331, :1743-1771), and the catch-up/liveness guard runs only when maxLag > 0 — so the operator path is provably unaffected. The §3.6 "manual escape hatch" bullet now states the operator force-transfer to a lagging follower stays un-gated.

Additional consistency item I caught while resolving #1/#2 (not in the review, flagged for transparency):
The conf-change guard (OQ-12) is placed in the same handleTransferLeadership and would otherwise have the same "does it gate the operator path?" ambiguity. I made the asymmetry explicit and deliberate in §3.5 / OQ-12 / PR2: the catch-up gate is maxLag-keyed (balancer-only), but the conf-change guard is unconditional (all callers, including cmd/raftadmin) — there is a legitimate operator reason to force a transfer onto a lagging follower (disaster recovery) but none to force one into an in-flight membership change (correctness hazard for any caller). Added a PR2 test asserting the bare TransferLeadershipToServer (maxLag = 0) still transfers to a lagging target while the conf-change guard rejects both methods.

Positively-verified items from the review: left unchanged (lastContact/leader-zero :3481-3483, cancelPendingConfig race :1366-1367/:3158-3165, TransferLeadershipToServerIfEligible mirroring handlePromoteLearner :1674-1688, maxLag = 1024 tuning note, ownership rule, companion gauge, milestone-table lifecycle).

No new codex/gemini inline findings at HEAD ef5dbd9b. Status stays Proposed; doc-only, no Go tests. New HEAD: 1304cfb3.

claude · 2026-06-11T06:33:40Z

Claude finished @bootjp's task in 8m 36s —— View job

Review: `docs/design/2026_06_11_proposed_leader_balance_scheduler.md` (round 5 — `1304cfb`)

Gather context and understand the request
Read the updated design document (1304cfb)
Verify round-4 dispositions and check pending codex findings (P1: pendingConfigs, P2: SQS exclusion)
Check for new issues introduced by revisions
Post comprehensive review

Overall: Accept with targeted corrections. All round-4 dispositions were applied correctly — the max_lag = 3 proto extension is well-specified and backward-compatible, the operator-path behavioral note is clear, and the conf-change guard asymmetry (catch-up gate keyed on maxLag; conf-change guard unconditional) is now explicitly documented. Two items from the codex review at ef5dbd9b were not addressed in 1304cfb3 and need resolution before marking *_partial_*.

Must-address before marking `_partial_`

1. Codex P1 — len(e.pendingConfigs) is empty on a newly elected leader; the conf-change guard has a correctness gap

The proposed guard — len(e.pendingConfigs) > 0 inside handleTransferLeadership and PendingConfChange bool on Status — detects only conf-changes proposed through this node's admin request path. storePendingConfig is called in proposeMembershipChange (internal/raftengine/etcd/engine.go:1707, :1733) only by the node that is the leader at proposal time. The adminRequest struct is never replicated to other nodes.

When leadership changes:

Node A (old leader) proposes a conf-change → A.pendingConfigs[id] = req → quorum commits the entry.
Leadership moves to Node B before the conf-change is applied.
B.pendingConfigs is empty (no local admin request) — resolveConfigChange (etcd/engine.go:3796-3808) calls popPendingConfig(id) when applying the conf-change entry, finds nothing, and returns early. That is correct behavior for the waiter: B has no waiter to notify. But it means the guard len(B.pendingConfigs) > 0 is false even while the conf-change is committed-but-unapplied on Node B.
If the balancer (case 2 forwarded to Node B) fires at this moment, handleTransferLeadership sees len(pendingConfigs) == 0 and proceeds — skipping exactly the conf-change it is supposed to exclude.

Is this practically dangerous? Less than the codex comment suggests, because:

Uncommitted conf-changes (proposed but not committed by quorum before the leader change): etcd/raft truncates the uncommitted tail when a new leader is elected, so Node B never sees them. pendingConfigs empty is correct here.
Committed-but-unapplied conf-changes: Node B applies all committed log entries in drainReady on the event loop, which runs immediately after becoming leader. The startup grace period (§3.1, default max(globalCooldown, interval) ≈ 30–40 s) ensures the balancer does not fire while the new leader is settling. By the time the grace expires, the conf-change has been applied.

So the gap is real but bounded: it exists only if the startup grace expires before a committed conf-change is applied on the new leader, which requires ≥ 30 s delay in event-loop log application — effectively impossible under normal conditions.

What the document should say: The current §3.5 wording implies the guard is complete: "rejects the transfer with a new errLeadershipTransferConfChangePending when len(e.pendingConfigs) > 0" — no caveat. Add a single explanatory sentence alongside the lock note:

"This guard covers conf-changes initiated on the current leader. Conf-changes committed by a previous leader but not yet applied on the new leader are an implicit gap — pendingConfigs is empty on the new leader for those changes. The startup grace period (§3.1) bounds the risk: all committed log entries (including pending conf-changes) are applied before the grace expires, so the balancer cannot fire while such a gap exists in practice."

The same note belongs in OQ-12. If a reviewer later wants a zero-false-negative guard, rawNode.BasicStatus().Commit > e.applied (any unapplied committed entries) is conservative but would fire spuriously under write load and is unnecessary for v1. Fix this →

2. Codex P2 — PR2 ships without SQS-refusal exclusion; --leaderBalance is unsafe on clusters with partitioned FIFO groups before PR3

PR2 "completes count-based balancing" and is marked "independently shippable after PR0." But the SQS-refusal exclusion — which prevents guaranteed ping-pong when the balancer transfers a partitioned-FIFO group to a node that immediately refuses leadership of it (main_sqs_leadership_refusal.go:69-121) — is in PR3. On any cluster with --sqsFifoPartitionMap set and a non-htfifo-capable node in the voter set, enabling --leaderBalance at PR2 level causes a permanent oscillation loop: balancer transfers the group → refusal hook transfers it away → balancer sees imbalance next cycle (after cooldown) and transfers again.

The §3.5 "Not policy-pinned" bullet correctly states that "v1 excludes partitioned-FIFO groups from balancing entirely" — but this exclusion is not yet implemented in PR2. There is no "independently shippable on non-SQS clusters only" caveat on the PR2 row.

Two acceptable resolutions:

Move the SQS group exclusion into PR2. The implementation is trivially small: check partitionedGroupSet[groupID] in the eligibility filter (this set is already constructed at startup from --sqsFifoPartitionMap + shard_config.go:174-196). OQ-2's more ambitious "balance among htfifo-capable nodes" stays in PR3. This is strongly preferred: it matches what §3.5 says v1 does, and it prevents a correctness hazard for any operator who enables PR2 early on an SQS deployment.
Add an explicit caveat to PR2's "independently shippable?" column: "After PR0 — completes count-based balancing on non-SQS-FIFO clusters only; enabling before PR3 on clusters with --sqsFifoPartitionMap causes ping-pong."

Option (1) is the right call given that the check is one-liner and §3.5 already commits to the exclusion as a v1 property. Fix this →

What's well done in this revision

OQ-4 resolution (max_lag = 3 proto extension): the single-routing-rule server handler (route on field value, not on method name), the backward-compatibility argument (proto3 absent = 0 = ungated), and the explicit statement that cmd/raftadmin leaves the field unset are all correct.
Behavioral asymmetry note (catch-up gate maxLag-keyed vs. conf-change guard unconditional) is now explicit in §3.5, §3.6, and §4 PR2 test plan. The justification ("no legitimate reason to force a transfer into an in-flight conf-change, but there IS a legitimate reason to force onto a lagging follower") is correct.
--leaderBalanceMaxTargetLastContact removed: Progress.RecentActive is the right per-peer leader-side liveness signal and avoids the lastContactFor-always-0-on-leader trap (etcd/engine.go:3481-3483).
Forward-path client note ("mirror dial/credentials in cmd/raftadmin/main.go:72-105; reuse admin connection cache from startAdminFromFlags") gives implementers a concrete starting point rather than a blank canvas.
maxLag = 1024 throughput note is now correctly framed as "conservative at moderate throughput" rather than "one election timeout of log" — the derivation for high-throughput (raftElectionTimeoutTicks × tickInterval × raftMaxEntriesPerMsg) is accurate.
PR2 test plan completeness: the three-way routing test (operator maxLag=0 → ungated, balancer maxLag>0 → gated, conf-change guard → both methods) captures the key behavioral invariants.
| Branch

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1304cfb394

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T06:34:59Z

+  - **`Status.LastContact` is the WRONG signal for target liveness and is identically 0 on a leader (codex P2 / claude #1).** Round 2 specified a case-1 liveness gate of `Status.LastContact < electionTimeout/2`. That is ineffective: `lastContactFor` returns `0` unconditionally when the local engine is `StateLeader` (`internal/raftengine/etcd/engine.go:3481-3483`), and the field it derives from is a **single** value (`e.lastLeaderContactFrom` / `e.lastLeaderContactAt`, set in `recordLeaderContact`, `:3540-3546`) describing *this follower's contact with ITS leader* — there is no per-peer leader→follower tracking. In a case-1 transfer the balancer host **is** the leader, so its own `Status().LastContact` is always `0`, `0 < electionTimeout/2` is trivially true, and the gate would pass even for a partitioned/dead target. The correct per-peer leader-side liveness signal is **`Progress.RecentActive`** (`go.etcd.io/raft/v3 tracker/progress.go:87-91` — "true if the progress is recently active … reset to false after an election timeout"), which is in the leader-only `Progress` map alongside `Match`.
+  - **Ownership rule (unchanged, now exhaustive):** target catch-up *and* liveness validation is the **responsibility of the node that executes the transfer — which is always the group leader** (case 1: the balancer host itself; case 2: the remote source-group leader the request is forwarded to, §3.4). The group leader is the only node that holds the live `Progress` map. The balancer host does **not** assert "target X is caught up / live"; it expresses a **load preference** (the least-loaded eligible voter, or an ordered preference list least-loaded→most-loaded) and the executing leader applies catch-up + liveness as the final, authoritative gate. This holds for **both** sub-cases — there is no longer a separate "case-1 pre-filter on `Status`" path, because the only place the signal exists is inside the engine on the leader.
+  - **Concrete eligibility predicate, enforced by a new engine-side guard `handleTransferLeadership` runs *before* `rawNode.TransferLeader`** (mirroring how `handlePromoteLearner` already gates on `Progress.Match`, `internal/raftengine/etcd/engine.go:1674-1688`). For the resolved `target.NodeID`, read `e.rawNode.Status().Progress[target.NodeID]` on the event loop and require **all** of:
+      - **Caught up:** `progress.Match >= status.Commit - maxLag` (entry units; default **`--leaderBalanceMaxTargetLag = 1024` entries**). On the executing leader both `progress.Match` and the commit index are local reads. A transfer to a target more than `maxLag` behind is rejected before it can pause proposals. **`maxLag = 1024` is a conservative default that keeps the transfer window short at moderate append throughput; it is NOT a fixed "one election timeout" of log — at high write rates 1024 entries is a small fraction of an election timeout (and would reject most targets), at low rates it is many.** Operators running high-throughput workloads should raise it toward roughly `raftElectionTimeoutTicks × tickInterval × raftMaxEntriesPerMsg` entries (the log a healthy follower can fall behind within one election timeout at that throughput); tighter values risk never finding an eligible target under steady write load, looser values risk a longer write stall on the new leader's group while raft catches the target up.


Saturate the commit-minus-lag check

With the proposed uint64 predicate, clusters whose commit index is below the default lag budget hit unsigned underflow (status.Commit - maxLag becomes a huge value), so even fully caught-up followers are rejected and the PR2 convergence test can never move leaders in a fresh/low-traffic cluster. Make the threshold saturating, e.g. compare against zero when Commit <= maxLag, before relying on this guard.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-11T06:34:59Z

+2. **The over-loaded source node is a *different* node.** The balancer host is a follower of that group and cannot initiate the transfer locally. It **forwards** the request to the source node's leader of that group, using the in-tree per-group `RaftAdmin` gRPC service: dial the source node's group listener (`rt.spec.address` for that group is in the group's `Configuration`, available via `engine.Configuration(ctx)`, `internal/raftengine/engine.go:213-215`) and call `RaftAdmin.TransferLeadership` with `TargetId`/`TargetAddress` set to the chosen target **and the new `max_lag` field set to `--leaderBalanceMaxTargetLag`** (see the proto extension below). The receiving node's engine is the leader of that group, so `handleTransferLeadership` accepts it — and because `max_lag > 0` the handler routes to the **gated** `TransferLeadershipToServerIfEligible` (§3.5), so the remote leader's engine-side catch-up/liveness guard runs there.
+
+**Forward-path routing mechanism — decision: extend `RaftAdminTransferLeadershipRequest` with an optional `max_lag` field (resolves OQ-4).** Today `RaftAdminTransferLeadershipRequest` carries only `target_id = 1` and `target_address = 2` (`proto/service.proto:248-251`), and the handler maps unconditionally to the **un-gated** `TransferLeadershipToServer` (`internal/raftadmin/server.go:163`). There is no field on the wire that can carry the balancer's `maxLag` to the remote leader, so the case-2 forwarded path has **no way to reach the gated `…IfEligible` engine method** without a wire change. PR2 adds an optional `uint64 max_lag = 3` field (the next free field number) to `RaftAdminTransferLeadershipRequest`:
+   - **Backward compatible by proto3 construction.** An absent field decodes to the zero value, so any existing client — every `cmd/raftadmin` call, which builds the request with only `TargetId`/`TargetAddress` set (`parseTransferTarget`, `cmd/raftadmin/main.go:370-378`) — sends `max_lag = 0` implicitly. Old and new binaries interoperate in both directions; no `cmd/raftadmin` change is required.


Gate mixed-version forwards before claiming safety

This compatibility claim is unsafe for rolling upgrades: an old RaftAdmin server ignores the new unknown max_lag field and its current handler unconditionally calls the ungated TransferLeadershipToServer (internal/raftadmin/server.go:155-172), so a new balancer forwarding to an old group leader can still transfer to a lagging target and stall writes. The design should require all group leaders to advertise/support the gated handler before enabling the balancer, or otherwise fall back to no-op/retry instead of treating old/new interop as safe.

Useful? React with 👍 / 👎.

P1 (codex): fix the §3.5 catch-up predicate uint64 underflow. The naive 'Progress.Match >= Commit - maxLag' wraps when Commit < maxLag (young / fresh / low-traffic group), rejecting every target and stalling the balancer. Specify the underflow-safe 'Commit - Match <= maxLag' form (Match <= Commit holds for any non-self follower) and the equivalent saturating-threshold reformulation. Propagate the corrected form to all references (§3.4, §3.5, §4 PR2, §6, OQ-13) and add the young-group test case (Commit < maxLag, caught-up target accepted) to §5 + PR2 tests. P2 (codex): gate mixed-version forwards. An old RaftAdmin server drops the unknown max_lag proto3 field and executes the forwarded transfer ungated. Add (1) structural ordering — the gated handler and the only emitter (balancer) ship in the same PR2 binary, so single-binary skew is impossible; (2) an operator rollout rule (do not enable --leaderBalance until all nodes run the gated binary), mirroring the encryption mutator rolling-restart discipline; (3) an optional transfer_gate capability pre-check (OQ-14) reusing the in-tree GetCapability fan-out precedent, with a peer_ungated skip metric. Folded into §3.4, §3.6, §4 PR2, §6. Status stays Proposed; doc-only.

bootjp · 2026-06-11T06:42:50Z

@claude review

Round 5 — addressed both codex findings (pushed 16710a56, doc-only, Status: Proposed).

codex P1 "Saturate the commit-minus-lag check" — ACCEPTED (verified).
The §3.5 catch-up predicate progress.Match >= status.Commit - maxLag is uint64 arithmetic; when Commit < maxLag (default --leaderBalanceMaxTargetLag = 1024) — the steady state of any young/fresh/low-traffic group — Commit - maxLag wraps to ≈2^64, so even a fully caught-up follower (Match == Commit) is rejected and the balancer is permanently stuck (the PR2 convergence test can never move a leader on a fresh cluster). Fixed:

§3.5 now specifies the underflow-safe Commit - Match <= maxLag form, with the proof that Match <= Commit holds for any non-self follower (a leader advances commit only after a quorum has the entry; a follower's Match is bounded by what the leader holds; the one value that can momentarily exceed commit is the leader's own Match, but the guard never runs against self — resolveTransferTarget rejects self). The equivalent explicit saturating-threshold reformulation is given as the alternative; the gap-on-the-left form is preferred (single branch-free expression, self-documenting).
Propagated the corrected form to every other reference (§3.4, §3.5 OQ-13 mention, §4 PR2 row, §6 concurrency lens, OQ-13), each contrasting against the wrong wrapping form so it can't be re-introduced.
Added the young-group test case to §5 (and the PR2 test column): Commit = 5, maxLag = 1024, target Match = 5 → accepted (the case the naïve form rejected); plus Match = 0 accepted, a long-log Commit = 100000/Match = 90000 rejected / Match = 99500 accepted, and a rapid property variant asserting the decision equals Commit - Match <= maxLag and never depends on wraparound.

codex P2 "Gate mixed-version forwards" — ACCEPTED (verified).
Confirmed against internal/raftadmin/server.go:155-172 (handler routes unconditionally to the un-gated TransferLeadershipToServer) and proto/service.proto (syntax = "proto3"). An OLD RaftAdmin server treats the new max_lag = 3 field as an unknown proto3 field, silently drops it, and executes the forwarded transfer ungated — so a new balancer forwarding max_lag > 0 to an old leader gets an ungated transfer while believing it requested a gated one. The "old/new interoperate in both directions" claim was true only for the wire decode, not the gating semantics. Fixed with three layers in §3.4 / §3.6 / §4 / §6 / new OQ-14:

Structural ordering (primary): the gated handler + the max_lag field + the only emitter (the balancer, behind --leaderBalance default OFF) all ship in the same PR2 binary — a binary that can emit max_lag > 0 necessarily honors it on receive, so single-binary skew is impossible.
Operator rollout rule: do not set --leaderBalance=true until every node runs the PR2 gated binary — mirroring the in-tree encryption mutator rolling-restart discipline ("don't enable any mutator RPC until every member reports capable", main_encryption_admin.go:42-49; docs/design/2026_04_29_partial_data_at_rest_encryption.md 6A caveat).
Optional capability pre-check (defense-in-depth, OQ-14): reusing the in-tree GetCapability fan-out precedent (adapter/encryption_admin.go:304-359, main_encryption_fanout.go) and the SQS htfifo poller (adapter/sqs_capability_poller.go:120) — advertise a transfer_gate capability bit; the balancer skips the forward (leaderbalance_skipped_total{reason=peer_ungated}) to any leader not advertising it. Recommendation: structural ordering + rollout rule in PR2 (sufficient when followed), capability pre-check in PR3 coupled with OQ-2's per-node capability publication.

Verification notes: proto3 unknown-field semantics confirmed against the in-tree syntax = "proto3" declaration; the field is implicit/optional (absent ⇒ 0 zero-value). Reused the existing capability-gate precedents in the repo rather than inventing a new mechanism — encryption GetCapability Voters∪Learners fan-out + documented rolling-restart discipline, and the SQS htfifo per-peer poller. No code change; Status: Proposed unchanged.

The latest claude review job (round 5, 1304cfb) was still in progress at push time with no findings posted; nothing to fold in. Re-requesting review on 16710a56 above.

claude · 2026-06-11T06:43:06Z

Claude finished @bootjp's task in 6m 3s —— View job

Review: `docs/design/2026_06_11_proposed_leader_balance_scheduler.md` (round 6 — `16710a5`)

Gather context and understand the request
Read the updated design document (16710a5)
Verify round-5 dispositions and check codex findings (P1: saturating lag check, P2: mixed-version forwards)
Check for new issues introduced by revisions
Post comprehensive review

Overall: Accept with one targeted correction. Both round-5 codex findings (P1 saturation, P2 mixed-version) are addressed correctly at the high level, and the mixed-version safety analysis (structural ordering + rollout rule + deferred capability pre-check) is well-reasoned. One substantive flaw was introduced by the P1 fix: the safety argument for the preferred Commit - Match <= maxLag form is incorrect, and the doc incorrectly claims the two forms are equivalent. The explicit saturating form — already presented in the doc as an alternative — is the correct implementation target and should be promoted as the recommended form before PR2.

Must fix before PR2

1. Commit - Match can underflow when Match > Commit — the safety argument is incorrect and the two forms are not equivalent

§3.5 states:

"This subtraction is safe because progress.Match <= status.Commit always holds for any follower — a leader advances commit only after a quorum (including itself) has the entry, and a follower's Match is the highest index the leader has confirmed replicated to that follower, which is bounded above by what the leader itself holds and therefore by the commit index."

The invariant Match <= Commit is false for clusters with quorum > 2. In etcd/raft v3.6, Progress[i].Match is updated when the leader receives a MsgAppResp acknowledging index N from peer i (leader's stepLeader calls pr.MaybeUpdate(m.Index) then r.maybeCommit()). In a 5-node cluster (quorum = 3):

If the leader and peer A have both replicated entry N=100 (so Progress["A"].Match = 100), but peers B, C, D have not yet acknowledged N, then quorum of 3 is not reached and CommitIndex stays at its previous value (say 80).
At this stable intermediate state: Status().Progress["A"].Match = 100 > Status().Commit = 80.
In Go: uint64(80) - uint64(100) wraps to ≈ 2^64 - 20.
(2^64 - 20) <= maxLag (1024) is false → the guard incorrectly rejects a follower that is actually more caught up than the committed state and is an excellent transfer target.

The net effect is conservative (rejects a healthy candidate rather than accepting a lagging one), but:

In a 5-node cluster with active writes, this condition is stably observable and not transient — A's Match > Commit persists until two more followers catch up to entry N. During that window, A is always rejected even though it is the best possible transfer target.
If every candidate follower happens to have Match > Commit simultaneously (common during sustained high write throughput on a 5-node cluster), the guard rejects all candidates and the balancer skips the group entirely for a full cycle.
The doc explicitly says the two forms "yield the same result" — this is only true if Match <= Commit holds universally, which it does not.
The property test constraint Match ∈ [0, Commit] deliberately excludes this case; an implementer following the test will miss it.

In a 3-node cluster (quorum = 2), the issue is nearly transient: the leader plus any one follower forms a quorum, so CommitIndex advances as soon as any follower's MsgAppResp is processed (in the same event-loop step as MaybeUpdate). The problematic stable state doesn't arise. But PR0's topology and the §5 integration tests are written generically, and the guard spec should be correct for all supported cluster sizes.

Fix: Swap the recommendation — make the explicit saturating form the primary and demote the Commit - Match form to "equivalent only when Match ≤ Commit":

"Eligible iff progress.Match >= threshold where threshold = status.Commit - maxLag if status.Commit > maxLag, else threshold = 0. Implementation: var threshold uint64; if status.Commit > maxLag { threshold = status.Commit - maxLag }; eligible = progress.Match >= threshold. Do not use status.Commit - progress.Match <= maxLag directly — this is equivalent only when progress.Match <= status.Commit, which is not guaranteed in clusters with quorum > 2."

Also update the rapid property invariant to cover the Match > Commit case: for any Commit, any Match ∈ [0, Commit + maxLag], and any maxLag, the guard should accept exactly when Match >= max(0, Commit - maxLag). Fix this →

What's well done in this revision

codex P2 / mixed-version safety. The three-layer defense (structural ordering + rollout rule + deferred capability pre-check) is correct and well-grounded. Specifically: the structural-ordering argument ("a binary that can emit max_lag > 0 is by construction the same binary that honors it on receive — the emitter and the handler ship in the same PR2 binary") is the right primary defense, and it correctly parallels the encryption mutator rolling-restart discipline (main_encryption_admin.go:42-49). The old-leader-ignores-unknown-field failure mode (internal/raftadmin/server.go:155-172 routes to un-gated TransferLeadershipToServer when MaxLag is absent, confirmed against proto/service.proto:248-251 and proto/service.pb.go:2074-2119) is accurately characterized. OQ-14 is the right deferral point.

Commit - Match <= maxLag fixes the young-group regression. The P1 fix correctly identifies that Match >= Commit - maxLag underflows when Commit < maxLag on a fresh/low-traffic group, and the alternative form avoids that specific failure mode. The young-group test cases (Commit = 5, maxLag = 1024, Match ∈ {0, 5} → accepted; Commit = 100000, Match = 90000 → rejected) are correct for those values and the right regression anchors. The issue above is orthogonal: a different underflow direction (Match > Commit), not covered by those test cases.

cancelPendingConfig race / lock requirement. The doc correctly carries the lock requirement from round 3: submitAdminEx:1366-1367 (cancelPendingConfig called from a non-event-loop ctx.Done() arm) vs. cancelPendingConfig:3158-3165 (e.pending.Lock() before delete) confirms reading len(e.pendingConfigs) in handleTransferLeadership without the lock races under -race. The note is in the right places (§3.5, OQ-12) and the PR2 test plan asserts race-freedom.

maxLag = 1024 tuning note. The corrected framing ("conservative at moderate throughput; at high write rates 1024 entries is a small fraction of an election timeout — tune toward raftElectionTimeoutTicks × tickInterval × raftMaxEntriesPerMsg") is accurate and avoids misleading operators who size their cluster for high write throughput.

OQ-14 scope. Correctly defers the capability pre-check to PR3 as defense-in-depth and notes it can subsume OQ-2's htfifo per-node publication, collapsing two mechanisms into one.

| Branch

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 16710a56a0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T06:44:27Z

+  - **`Status.LastContact` is the WRONG signal for target liveness and is identically 0 on a leader (codex P2 / claude #1).** Round 2 specified a case-1 liveness gate of `Status.LastContact < electionTimeout/2`. That is ineffective: `lastContactFor` returns `0` unconditionally when the local engine is `StateLeader` (`internal/raftengine/etcd/engine.go:3481-3483`), and the field it derives from is a **single** value (`e.lastLeaderContactFrom` / `e.lastLeaderContactAt`, set in `recordLeaderContact`, `:3540-3546`) describing *this follower's contact with ITS leader* — there is no per-peer leader→follower tracking. In a case-1 transfer the balancer host **is** the leader, so its own `Status().LastContact` is always `0`, `0 < electionTimeout/2` is trivially true, and the gate would pass even for a partitioned/dead target. The correct per-peer leader-side liveness signal is **`Progress.RecentActive`** (`go.etcd.io/raft/v3 tracker/progress.go:87-91` — "true if the progress is recently active … reset to false after an election timeout"), which is in the leader-only `Progress` map alongside `Match`.
+  - **Ownership rule (unchanged, now exhaustive):** target catch-up *and* liveness validation is the **responsibility of the node that executes the transfer — which is always the group leader** (case 1: the balancer host itself; case 2: the remote source-group leader the request is forwarded to, §3.4). The group leader is the only node that holds the live `Progress` map. The balancer host does **not** assert "target X is caught up / live"; it expresses a **load preference** (the least-loaded eligible voter, or an ordered preference list least-loaded→most-loaded) and the executing leader applies catch-up + liveness as the final, authoritative gate. This holds for **both** sub-cases — there is no longer a separate "case-1 pre-filter on `Status`" path, because the only place the signal exists is inside the engine on the leader.
+  - **Concrete eligibility predicate, enforced by a new engine-side guard `handleTransferLeadership` runs *before* `rawNode.TransferLeader`** (mirroring how `handlePromoteLearner` already gates on `Progress.Match`, `internal/raftengine/etcd/engine.go:1674-1688`). For the resolved `target.NodeID`, read `e.rawNode.Status().Progress[target.NodeID]` on the event loop and require **all** of:
+      - **Caught up — saturating / underflow-safe comparison (resolves codex round-5 P1):** the lag check must be written so it never underflows `uint64`. The naïve form `progress.Match >= status.Commit - maxLag` is **wrong**: `status.Commit` and `maxLag` are both `uint64`, so when `status.Commit < maxLag` — the steady state of any *young* group whose commit index has not yet reached the default `1024`-entry budget (a fresh cluster, a freshly-split group, a low-traffic group) — `status.Commit - maxLag` wraps to a value near `2^64`. The predicate then rejects **every** target, including a follower that is perfectly caught up (`Match == Commit`), so the balancer is permanently stuck and the PR2 convergence test can never move a leader on a fresh/low-traffic cluster. The guard must instead compare the **gap on the left**, which cannot underflow: `status.Commit - progress.Match <= maxLag`. This subtraction is safe because **`progress.Match <= status.Commit` always holds for any follower** — a leader only advances its own commit index after a quorum (including itself) has the entry, and a follower's `Match` is the highest index the leader has confirmed *replicated to that follower*, which is bounded above by what the leader itself holds and therefore by the commit index; etcd/raft never records a `Match` ahead of the leader's last index, let alone its commit. (The one value that can momentarily exceed the leader's commit is the leader's *own* `Match`, but the guard never runs against self — `resolveTransferTarget` rejects self, §3.5 — so the subtrahend is always a real follower's `Match ≤ Commit`.) Equivalently, an explicit saturating threshold `threshold := status.Commit; if status.Commit > maxLag { threshold = status.Commit - maxLag } else { threshold = 0 }` then `progress.Match >= threshold` yields the same result without relying on the `Match ≤ Commit` argument; **the `Commit - Match <= maxLag` form is preferred** because it is a single branch-free expression and self-documents the "gap ≤ budget" intent. Units are log entries; default **`--leaderBalanceMaxTargetLag = 1024` entries**. On the executing leader both `progress.Match` and the commit index are local reads. A transfer to a target more than `maxLag` behind is rejected before it can pause proposals. **`maxLag = 1024` is a conservative default that keeps the transfer window short at moderate append throughput; it is NOT a fixed "one election timeout" of log — at high write rates 1024 entries is a small fraction of an election timeout (and would reject most targets), at low rates it is many.** Operators running high-throughput workloads should raise it toward roughly `raftElectionTimeoutTicks × tickInterval × raftMaxEntriesPerMsg` entries (the log a healthy follower can fall behind within one election timeout at that throughput); tighter values risk never finding an eligible target under steady write load, looser values risk a longer write stall on the new leader's group while raft catches the target up. Note the young-group case makes the gate **permissive** on a fresh cluster (`Commit - Match` is small, usually 0), which is correct: there is no large backlog to catch up on, so any live voter is a safe target — the danger the gate exists to prevent (a target tens of thousands of entries behind) only arises once the log is long.


Avoid underflow when Match exceeds Commit

In a leader with uncommitted entries, a follower's Progress.Match can be ahead of the leader's committed index (for example, the target has acknowledged entries that have not reached quorum or are from a prior term), so status.Commit - progress.Match can still underflow. In that scenario the proposed branch-free gate rejects a target that is at least caught up to the commit point, which can make balancing fail spuriously; use a saturating gap such as zero when Match >= Commit rather than relying on Match <= Commit.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-11T06:44:27Z

+
+**Forward-path routing mechanism — decision: extend `RaftAdminTransferLeadershipRequest` with an optional `max_lag` field (resolves OQ-4).** Today `RaftAdminTransferLeadershipRequest` carries only `target_id = 1` and `target_address = 2` (`proto/service.proto:248-251`), and the handler maps unconditionally to the **un-gated** `TransferLeadershipToServer` (`internal/raftadmin/server.go:163`). There is no field on the wire that can carry the balancer's `maxLag` to the remote leader, so the case-2 forwarded path has **no way to reach the gated `…IfEligible` engine method** without a wire change. PR2 adds an optional `uint64 max_lag = 3` field (the next free field number) to `RaftAdminTransferLeadershipRequest`:
+   - **Backward compatible by proto3 construction.** An absent field decodes to the zero value, so any existing client — every `cmd/raftadmin` call, which builds the request with only `TargetId`/`TargetAddress` set (`parseTransferTarget`, `cmd/raftadmin/main.go:370-378`) — sends `max_lag = 0` implicitly. Old and new binaries interoperate in both directions; no `cmd/raftadmin` change is required.
+   - **Single routing rule in the server handler.** `RaftAdmin.TransferLeadership` (`internal/raftadmin/server.go:155-172`) routes on the field: when `req.MaxLag > 0` it calls the **gated** `TransferLeadershipToServerIfEligible(ctx, target.ID, target.Address, req.MaxLag)`; when `req.MaxLag == 0` (the existing default for every operator call) it keeps calling the **un-gated** `TransferLeadershipToServer` exactly as today.


Preserve gated routing when max lag is zero

Routing req.MaxLag == 0 to the ungated method makes --leaderBalanceMaxTargetLag=0 a dangerous footgun: an operator trying to require an exactly caught-up target would instead disable the balancer's catch-up/liveness guard and allow transfers to lagging followers that can stall writes. Since zero is a valid strict lag budget, distinguish automated-vs-operator requests by field presence or an explicit mode bit rather than by the numeric value alone.

Useful? React with 👍 / 👎.

Round-6 review found the round-5 P1 fix introduced an incorrect safety argument: the preferred `Commit - Match <= maxLag` form underflows uint64 when Match > Commit, which CAN stably occur for a real follower in any cluster with quorum > 2 (e.g. a 5-node group: a follower acks entry N and Progress.Match advances on the MsgAppResp, but CommitIndex lags until a quorum holds N). The doc also wrongly claimed the two subtraction forms are equivalent and that Match <= Commit is a universal invariant. Promote the explicit saturating form as THE recommended implementation target across every place the predicate appears (3.4/3.5/4 PR2/5/6/OQ-13): threshold = Commit > maxLag ? Commit - maxLag : 0; eligible = Match >= threshold Mark BOTH bare subtraction one-liners as incorrect, each with its underflow direction, and remove the equivalence/invariant claim. Extend the 5 test list to cover the Match > Commit counterexample (Commit=80, Match=100 -> accepted) and widen the rapid property invariant range from Match in [0, Commit] to Match in [0, Commit + maxLag] so the underflow case is actually generated. Doc-only. Status: Proposed.

bootjp · 2026-06-11T06:56:24Z

@claude review

Round-6 targeted correction (a19dc1f, doc-only, Status: Proposed). You were right: the round-5 P1 fix introduced an incorrect safety argument. The preferred Commit - Match <= maxLag form underflows uint64 when Match > Commit, which is a stable state for a real follower in any cluster with quorum > 2 — exactly your 5-node counterexample (Progress[A].Match = 100 while Commit = 80 because B/C/D have not yet acked entry N, so uint64(80) - uint64(100) ≈ 2^64 - 20 > maxLag and A is wrongly rejected). I've promoted the explicit saturating form (threshold = Commit > maxLag ? Commit - maxLag : 0; eligible = Match >= threshold) as THE recommended implementation target everywhere the predicate appears (§3.4, §3.5, §4 PR2 row, §5, §6, OQ-13), removed the false equivalence claim and the incorrect Match <= Commit invariant, and marked both bare subtraction one-liners as incorrect with their respective underflow directions. The §5 engine test list now covers the Match > Commit case (Commit=80, Match=100 → accepted) and the rapid property range is widened from Match ∈ [0, Commit] to Match ∈ [0, Commit + maxLag] so the underflow regime is actually generated. The tail items you verified positive (cancelPendingConfig lock note, maxLag tuning, OQ-14) are unchanged.

claude · 2026-06-11T06:56:38Z

Claude finished @bootjp's task in 5m 10s —— View job

Review: `docs/design/2026_06_11_proposed_leader_balance_scheduler.md` (round 7 — `a19dc1f`)

Gather context and understand the request
Read the updated design document (a19dc1f)
Verify round-6 disposition correctness — saturating predicate
Check codex P2 finding (max_lag == 0 footgun)
Verify §5 test case arithmetic
Check for new issues introduced by revisions
Post comprehensive review

Overall: Accept. The round-5 P1 saturating-predicate fix is correctly and completely implemented at a19dc1f3. All prior-round issues remain correctly addressed. One open codex P2 finding from round 6 is the only remaining item, and a minor documentation clarity note on the property-test reference notation.

Should address before PR2 starts

1. Codex round-6 P2 ("Preserve gated routing when max lag is zero") — still unresolved

The codex finding on 16710a56 was not addressed in a19dc1f3. The current routing rule (§3.4, line 111): "when req.MaxLag == 0 (the existing default for every operator call) it keeps calling the un-gated TransferLeadershipToServer" — creates a semantic discontinuity for --leaderBalanceMaxTargetLag:

The flag is named "max target lag." An operator (or future-maintainer) reading the flag name will expect --leaderBalanceMaxTargetLag=0 to mean "require the target to be fully caught up — zero lag allowed" (the strictest possible gate). Instead, 0 is the proto3 zero-value sentinel that routes to the un-gated path, disabling the catch-up gate entirely. So --leaderBalanceMaxTargetLag=0 silently has the opposite of its apparent meaning.

This is unlikely to matter in practice (the default is 1024 and operators would have to explicitly choose 0), but the doc commits PR2 to a specific flag contract that is surprising at the boundary. Two acceptable resolutions:

Option A — startup validation (recommended). At PR2 startup, if --leaderBalance=true && --leaderBalanceMaxTargetLag == 0, log a fatal: "leaderBalanceMaxTargetLag must be ≥ 1 when leaderBalance is enabled; use 1 to require the target to be within one entry of the commit index; 0 is reserved for the operator force-transfer path (max_lag absent ⇒ ungated)." Add one sentence to §3.6's tuning flags entry.
Option B — documentation note only. Add an explicit sentence: "--leaderBalanceMaxTargetLag=0 is not a valid balancer configuration — proto3 zero-value is the absent-field sentinel that routes to the un-gated operator path; operators who want the strictest catch-up gate should use 1 (requires Match ≥ Commit - 1)." This is weaker than A but removes the semantic ambiguity.

Option A is preferable because it makes the mis-configuration fail loudly, but either closes the codex finding. Fix this →

Minor note (no doc change required, just awareness)

2. Property-test reference expression uses mathematical notation that cannot be written literally in Go uint64

Line 197: "the gate decision equals Match >= max(0, Commit - maxLag) and never panics / never depends on uint64 wraparound."

The max(0, Commit - maxLag) expression uses mathematical notation. In Go, max(uint64(0), Commit - maxLag) would evaluate the subtraction first (wrapping to a huge number when Commit < maxLag) and then max(0, huge) = huge — the same bug the test is supposed to guard against. The property test's reference implementation must use the same saturating conditional from §3.5:

var threshold uint64
if Commit > maxLag { threshold = Commit - maxLag }
expected := Match >= threshold

This is covered by the existing §3.5 text ("Implement it with the explicit saturating branch… and do not write either subtraction form directly"), so no doc change is required. Just a heads-up for the PR2 test author.

Verification of round-5 P1 fix (saturating predicate)

The §5 test arithmetic is now correct:

Commit = 80, Match = 100, maxLag = 1024 → threshold = max(0, 80 - 1024) = 0 → 100 >= 0 → accepted ✓
Commit = 100000, Match = 100050, maxLag = 1024 → threshold = 98976 → 100050 >= 98976 → accepted ✓
Commit = 100000, Match = 90000, maxLag = 1024 → threshold = 98976 → 90000 < 98976 → rejected ✓
Match = 99500 >= 98976 → accepted ✓

The two-direction underflow analysis in §3.5 (lines 143–144) is correct and clearly motivated: Match >= Commit - maxLag wraps when Commit < maxLag (young group); Commit - Match <= maxLag wraps when Match > Commit (quorum > 2 stable intermediate state). The Match > Commit example (Progress[A].Match = 100, Commit = 80, quorum not yet reached at B/C/D) is accurately verified against etcd/raft v3.6 MaybeUpdate + maybeCommit sequencing. The rapid property range Match ∈ [0, Commit + maxLag] correctly includes the Match > Commit regime.

The saturating form is consistent across every site in the document (§3.4, §3.5, §4 PR2 row, §5, §6, OQ-13). All prior-round issues (LastContact-always-0-on-leader, cancelPendingConfig lock, OQ-4 proto extension, mixed-version rollout rule, voter map seeding, SQS exclusion in PR2) remain correctly addressed.

| Branch

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a19dc1f325

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T06:59:18Z

+  - **Source node** = the node with the **most** leaders (tie-break: lexicographically smallest node ID, so the decision is reproducible in tests).
+  - **Group to move** = among the groups led by the source node that are *eligible* (§3.5), pick deterministically (tie-break by group ID ascending).
+  - **Target node** = among nodes that are **voters of that group**, **healthy** (§3.5), and **not policy-excluded for that group** (§3.5), the one with the **fewest** leaders (tie-break: smallest node ID). The transfer is skipped if moving the leader to the target would not strictly reduce the spread (i.e. `targetCount + 1 > sourceCount - 1`), so a transfer never makes things worse or merely shuffles a tie.


Fall back to the next eligible source node

When the most-loaded node has no transferable groups because its leaders are pinned, in cooldown, mid-conf-change, or otherwise ineligible, this fixed source choice makes the scheduler skip even if the second-most-loaded node can still reduce the spread. For example, with counts A=5, B=4, C=0 and all A-led groups pinned, moving one eligible B-led group to C reduces the spread from 5 to 4, but this policy never considers B after selecting A. The decision function should rank candidate source nodes by count and choose the first eligible move that strictly reduces spread, rather than stopping at the absolute max node.

Useful? React with 👍 / 👎.

…r balance

bootjp · 2026-06-11T07:25:36Z

@claude review

Round 8 (e789c20) — doc-only, addresses the two remaining codex P2 items and the claude clarity note. Status stays Proposed.

Chosen resolution for codex P2 #3393836081 ("Preserve gated routing when max lag is zero"): Option (b) — a separate bool gated = 3 flag plus uint64 max_lag = 4 (not Option (a) proto3 explicit presence).

Why (b) over (a): the pinned toolchain (libprotoc 29.3 / protoc-gen-go v1.36.11) does support proto3 explicit presence (optional, stable since protobuf 3.15 / protoc-gen-go 1.27), but no in-tree .proto uses an optional field today (verified across proto/*.proto), so the Has_*/pointer-deref ergonomics would be a first; an explicit bool gated reads unambiguously at the handler and in tests. Recorded the (a) alternative and the toolchain-support fact in §3.4 / OQ-4 for the record.
Semantics now: routing keys on gated, not on the numeric value. gated = false (field absence — the default every existing cmd/raftadmin client sends) ⇒ ungated operator path; gated = true ⇒ gated path with max_lag as the budget, where max_lag = 0 is now the strictest gate (threshold = Commit, require Match >= Commit), exactly matching the flag name. So --leaderBalanceMaxTargetLag = 0 means strict gating; only field absence is ungated. Threaded through §3.4, §3.5 (engine-side gate now keyed on the method/"apply gate" marker, not maxLag > 0), §3.6 flag table (added the explicit "= 0 is a valid strict setting" note), §4 PR2 row + tests, §6, OQ-4, OQ-13. Verified against proto/service.proto:248-251 (only target_id=1/target_address=2 today, syntax = "proto3" at line 1) and the un-gated mapping at internal/raftadmin/server.go:155-172,163.

codex P2 #3393909075 ("Fall back to the next eligible source node"): fixed in §3.3. The policy no longer stops at the single most-loaded node — it iterates source candidates in descending leader-count order (tie-break smallest node ID) and takes the first with an eligible (group, target) pair that strictly reduces the spread; falls through to the next source if the max node has no eligible move (all pinned / in cooldown / mid-conf-change). One-transfer-per-cycle and full determinism preserved (only the source-try order is added; all tie-breaks unchanged); the default-group "balance last" rule still applies on top. Added the A=5-all-pinned / B=4 / C=0 → B → C case to §5 unit tests (plus three companion cases: stay-on-max, no-move-anywhere, fall-through-past-B-when-target-wouldn't-reduce) and a rapid invariant ("if any eligible strict-reducing pair exists, a transfer is returned").

claude clarity note (property-test reference notation): clarified §5 line — the max(0, Commit - maxLag) notation is mathematical; the property oracle must use the explicit saturating branch (var threshold uint64; if Commit > maxLag { threshold = Commit - maxLag }; expected := Match >= threshold), never Go max(uint64(0), Commit - maxLag) (which evaluates the wrapping subtraction first). Stated as Match ≥ saturating_sub(Commit, maxLag) with the caveat inline.

No newer top-level review comments after 2026-06-11T07:00:00Z. Doc-only; pushed e789c204.

claude · 2026-06-11T07:25:53Z

Claude encountered an error after 6m 48s —— View job

Review: `docs/design/2026_06_11_proposed_leader_balance_scheduler.md` (round 8 — `e789c20`)

Gather context and understand the request
Read the updated design document (e789c20)
Verify code citations for round-8 changes
Check correctness of bool gated = 3 / uint64 max_lag = 4 proto design
Check correctness of multi-source fallback policy (§3.3)
Review for new issues introduced by revisions
Post comprehensive review
| Branch

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e789c2046b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T07:28:08Z

+- **No conf-change in flight — enforced on the executing leader, not only at the balancer's decision time.** Skip a group whose membership is changing — a transfer racing a conf-change can land on a member about to be removed. Two distinct enforcement points are required, because the balancer's decision-time read and the transfer's execution can happen on different nodes (case 2, §3.4):
+  - **Decision-time observability (resolves OQ-5).** Surface the existing per-group pending-config state through the exported `Status`: the etcd backend already tracks `pendingConfigs` (`map[uint64]adminRequest`, guarded by `e.pending`, `internal/raftengine/etcd/engine.go:390`, `:563`, populated in `storePendingConfig` `:3138`), but it is not on the public `raftengine.Status` struct (`internal/raftengine/engine.go:67-83`). Add a `PendingConfChange bool` field to `Status` populated from `len(e.pendingConfigs) > 0` (mirroring how `LeadTransferee` was added to the struct), rather than leaking etcd internals via `rawNode.BasicStatus().Config.PendingConfIndex` past the engine boundary or adding a dedicated `Admin` method. This lets the balancer skip a group whose leader it *can* see has a pending conf-change. In **case 1** (the balancer host leads the group) this read is local and authoritative.
+  - **Execution-time guard on the group leader (required for case 2).** Decision-time `Status` is **not sufficient for the forwarded path**: in case 2 the balancer is a *follower* of the target group, so its local `Status().PendingConfChange` reflects only the local node and not the remote leader's `pendingConfigs`. And the current `RaftAdmin.TransferLeadership → TransferLeadershipToServer → handleTransferLeadership` path checks only leader state and raft's transfer acceptance (`internal/raftengine/etcd/engine.go:1743-1771`) — it does **not** consult `pendingConfigs`. Surfacing the field in `Status` alone therefore does not stop a forwarded transfer from racing an Add/Remove on the remote leader. PR2 must add a **server-side guard on the executing leader**: `handleTransferLeadership` rejects the transfer with a new `errLeadershipTransferConfChangePending` when `len(e.pendingConfigs) > 0`, before calling `rawNode.TransferLeader`. **The read must acquire `e.pending.Lock()` (claude round-3 #2):** running on the event loop is *not* by itself enough to make `e.pendingConfigs` consistent for a reader, because `cancelPendingConfig` deletes entries under `e.pending.Lock()` from a **non-event-loop goroutine** — the `ctx.Done()` arm of `submitAdminEx` (`internal/raftengine/etcd/engine.go:1366-1367`, `cancelPendingConfig` at `:3158-3165`). Reading `len(e.pendingConfigs)` without the lock is a data race under `-race`. The guard therefore snapshots `hasPending := len(e.pendingConfigs) > 0` under `e.pending.Lock()` and rejects when true. The same locking applies to the `PendingConfChange bool` field on `Status` (OQ-5): the `refreshStatus`/`Status()` snapshot path must read `len(e.pendingConfigs)` under `e.pending.Lock()` before writing the cached `e.status`. The balancer maps the rejection to `leaderbalance_skipped_total{reason=conf_change}` and retries next cycle. This makes the conf-change exclusion enforceable regardless of which node executes the transfer, which fits the §3.4 ownership rule (the executing leader owns the safety gate). **Unlike the catch-up/liveness gate (which is selected by the gated engine method — i.e. by `gated = true` on the wire — so it touches only the balancer path), the conf-change guard is *unconditional* — it applies to every transfer regardless of method or `maxLag`, including the operator `cmd/raftadmin` path.** That asymmetry is deliberate and is *not* a re-run of the catch-up-gate contradiction: a transfer racing a membership change can land leadership on a member about to be removed, which is a correctness hazard for *any* caller — there is no legitimate operator reason to force a transfer into an in-flight conf-change, whereas there *is* a legitimate operator reason to force a transfer onto a lagging follower (disaster recovery). So the conf-change guard correctly has no opt-out, while the catch-up gate correctly does (via `gated = false` / the un-gated method — **not** via `maxLag = 0`, which under the gated method is the strictest budget). **OQ-12** records the guard placement (in `handleTransferLeadership` vs. a check in the `RaftAdmin.TransferLeadership` server wrapper).
+- **No transfer already in flight for the group.** Skip a group whose leader already reports `Status().LeadTransferee != 0` (`internal/raftengine/engine.go:80-82`, populated on the leader, `etcd/engine.go:1431`). Issuing a second `TransferLeadershipToServer` while the first is still settling in `waitForLeadershipTransfer` would hit `errLeadershipTransferRejected` (`etcd/engine.go:1765-1768`) and log a spurious failure. In case 1 the balancer reads this locally; in case 2 the remote leader enforces it at submit time. Counted under `leaderbalance_skipped_total{reason=transfer_in_flight}`.


Add an execution-time in-flight-transfer guard

In the forwarded case where the balancer is only a follower of this group, its local Status().LeadTransferee cannot prove the remote leader is idle, and etcd/raft does not reject a second transfer to a different target: in v3.6 MsgTransferLeader aborts the previous transfer and starts the new one (raft.go:1637-1645). Unless PR2 also adds an explicit BasicStatus().LeadTransferee != 0 rejection inside handleTransferLeadership before rawNode.TransferLeader, this “remote leader enforces it at submit time” statement is false and the scheduler can cancel an existing/manual transfer and extend the write-stall window.

Useful? React with 👍 / 👎.

bootjp · 2026-06-11T08:50:45Z

@claude review

claude · 2026-06-11T08:51:03Z

Claude finished @bootjp's task in 6m 47s —— View job

Review: `docs/design/2026_06_11_proposed_leader_balance_scheduler.md` (round 9 — `e789c20`)

Gather context and understand the request
Read the updated design document (e789c20)
Verify codex P2 finding at line 142 — in-flight-transfer guard
Verify round-8 changes: bool gated = 3 / uint64 max_lag = 4 proto split
Verify round-8 multi-source fallback policy (§3.3)
Check for new issues introduced by revisions
Post comprehensive review

Overall: Accept with one must-fix. All round-8 changes are correct — the bool gated/uint64 max_lag proto split is well-specified, the multi-source fallback policy is right, and the property-test notation caveat is accurate. The new codex P2 finding at line 142 (in-flight-transfer guard) is substantive and correct: the design's claim "in case 2 the remote leader enforces it at submit time" is false, and a second transfer to a different target silently cancels the in-flight one rather than returning an error. This is the only issue to resolve before PR2.

Must-fix before PR2 starts

1. Codex P2 at line 142 — "remote leader enforces it at submit time" is incorrect; an execution-time in-flight-transfer guard is missing

The design at line 142 states: "Issuing a second TransferLeadershipToServer while the first is still settling in waitForLeadershipTransfer would hit errLeadershipTransferRejected (etcd/engine.go:1765-1768)" and "In case 1 the balancer reads this locally; in case 2 the remote leader enforces it at submit time." Both claims are incorrect. Verified against the in-tree code:

What handleTransferLeadership actually does when LeadTransferee = A and a new transfer to valid voter B arrives:

e.rawNode.TransferLeader(B) is called (:1758).
According to etcd/raft v3.6 raft.go:1637-1645: if leadTransferee != 0 and the new target differs from the current one, raft aborts the in-flight transfer to A and starts a new one to B, setting r.leadTransferee = B.
The post-call check at :1765: e.rawNode.BasicStatus().LeadTransferee != target.NodeID = B != B = FALSE → no error is returned.
The second transfer proceeds; the first caller's waitForLeadershipTransfer eventually sees LeadTransferee != A, and since it set *sawPending = true on an earlier poll (:1431-1432), it returns errLeadershipTransferAborted (:1435-1436).

So errLeadershipTransferRejected at :1765-1768 fires only when raft silently drops the request outright (target is a learner, self, or has no Progress entry) — not when it aborts one transfer to start another. The claim that the error fires for the in-flight case is wrong.

Why case 2 is uniquely exposed: Status.LeadTransferee is populated from basic.LeadTransferee (etcd/engine.go:2908), which is rawNode.BasicStatus().LeadTransferee. That value is the leader's r.leadTransferee — non-zero on the leader only, always 0 on followers (per the raftengine.Status comment at internal/raftengine/engine.go:78-82: "zero otherwise, including on followers"). In case 2 the balancer is a follower of the group, so its local Status().LeadTransferee is always 0. The observation-time skip ("skip a group whose leader already reports LeadTransferee != 0") is therefore a no-op for case 2 — the balancer always sees 0 locally regardless of what the remote leader's state is. And the remote handleTransferLeadership has no guard for this either. The net result: the balancer in case 2 can silently cancel an operator's cmd/raftadmin leadership_transfer_to_server or another system-initiated transfer.

Fix: Add an execution-time guard in handleTransferLeadership that reads BasicStatus().LeadTransferee before calling rawNode.TransferLeader (:1758) and rejects when non-zero:

// Between the StateLeader check and rawNode.TransferLeader:
if e.rawNode.BasicStatus().LeadTransferee != 0 {
    req.done <- adminResult{err: errors.WithStack(errLeadershipTransferAlreadyInFlight)}
    return
}

BasicStatus() is already called on the event loop here (:1754), so no new locking is needed — unlike the conf-change guard (which needs e.pending.Lock() because cancelPendingConfig mutates from off the loop), this read is racefree on the single-threaded event loop.

Should this guard be unconditional (like conf-change) or keyed on gated? The conf-change guard is unconditional because "there is no legitimate operator reason to force a transfer into an in-flight membership change." The in-flight-transfer guard is the same: canceling another caller's in-flight transfer is almost never intentional, and the cost (wait for the current transfer to complete or timeout, then retry) is low. Making it unconditional also protects case 1 against the TOCTOU window between observation and execution. Recommend unconditional; note it explicitly (same asymmetry note as conf-change: "unlike the catch-up/liveness gate which is keyed on the gated method marker, this guard has no opt-out — there is no legitimate caller reason to cancel a peer's in-flight transfer").

Concretely, the design should correct lines 142:

Replace "would hit errLeadershipTransferRejected" with: "raft would abort the in-flight transfer to the first target and start the new one — rawNode.BasicStatus().LeadTransferee would be updated to the new target, errLeadershipTransferRejected would NOT be returned (that error fires only when raft silently drops the request: learner/self/no-Progress target, not for an in-flight-cancel). The balancer would thus have silently canceled an operator's or system's transfer."
Replace "in case 2 the remote leader enforces it at submit time" with the correct execution-time guard description (paralleling the conf-change guard).
Add errLeadershipTransferAlreadyInFlight to the PR2 scope in §4 (alongside errLeadershipTransferConfChangePending).
Add to the §6 concurrency lens: "Verify the execution-time in-flight-transfer guard (LeadTransferee != 0 before rawNode.TransferLeader) is present in handleTransferLeadership and unconditional, so the balancer's case-2 forward path cannot silently cancel an operator's cmd/raftadmin transfer."

Fix this →

What's well done in this revision

bool gated = 3 / uint64 max_lag = 4 proto split (round-8). The semantics are correct and fully specified: gated = false (proto3 default, all existing clients) → un-gated operator path; gated = true + max_lag → gated, where max_lag = 0 is the strictest gate (require Match >= Commit, not a disable signal). The behavioral impact on cmd/raftadmin (leaves both fields unset → ungated force-transfer preserved) is stated explicitly, as is the backward-compat argument (absent gated ⇒ false ⇒ ungated). The chosen form over optional uint64 max_lag with Has_max_lag is correctly justified by the absence of optional fields in the in-tree .proto files.

Multi-source fallback (§3.3, round-8). The descending-count iteration with fall-through is correct. The four companion test cases (§5) — (a) A-has-eligible-group stays on A; (b) all-pinned no-decision; (c) second-source target wouldn't reduce spread → fall through; (d) A=5-pinned B=4-eligible-C=0 → B→C — are all correct and together cover the boundary conditions. The rapid property invariant ("if any eligible strict-reducing pair exists, a transfer is returned") is the right specification and closures the case where the single max-node is pinned.

Property-test notation caveat (round-8 / claude round-7). The warning that max(0, Commit - maxLag) is mathematical notation, not Go code, and that the reference oracle must use the explicit saturating branch (var threshold uint64; if Commit > maxLag { threshold = Commit - maxLag }) is correct and concretely prevents the most likely mistranscription (Go max(uint64(0), Commit - maxLag) evaluates the wrapping subtraction first).

All prior-round fixes remain intact: saturating-threshold predicate (both underflow directions documented and tested), voter-map zero-seed, conf-change guard with e.pending.Lock(), LastContact-always-0-on-leader correction, TransferLeadershipToServerIfEligible engine method, OQ-4 resolved, mixed-version rollout rule, --leaderBalanceMaxTargetLag = 0 footgun fix, SQS exclusion in PR2.

| Branch

bootjp · 2026-06-11T09:05:08Z

@claude review

Disposition: must-fix resolved — codex round-9 P2 (execution-time in-flight-transfer guard). Doc-only, Status: Proposed. Pushed 14222f27.

Root cause confirmed against the tree. handleTransferLeadership (internal/raftengine/etcd/engine.go:1743-1771) checks only leader-state and the post-TransferLeader accept (:1765-1768); it has no in-flight-transfer guard. Verified in vendored go.etcd.io/raft/v3@v3.6.0 raft.go:1631-1662: a second MsgTransferLeader to a different target calls r.abortLeaderTransfer() and restarts to the new target (r.leadTransferee = leadTransferee) — it does not reject. So after the abort BasicStatus().LeadTransferee == target.NodeID, the :1765 != check is false, no error is returned, and the first caller's waitForLeadershipTransfer returns errLeadershipTransferAborted (:1424/:1435-1436). Both round-8 claims — "would hit errLeadershipTransferRejected" and "in case 2 the remote leader enforces it at submit time" — were false. The decision-time Status().LeadTransferee skip is a no-op in case 2 (LeadTransferee is 0 on a follower, internal/raftengine/engine.go:78-82).

Fix (§3.5 'No transfer already in flight' bullet, rewritten): PR2 adds an execution-time guard in handleTransferLeadership that reads e.rawNode.BasicStatus().LeadTransferee before rawNode.TransferLeader and rejects non-zero with a new errLeadershipTransferInFlight. Race-free on the event loop with no new lock (the same read is already used at :1567/:1593), unlike the conf-change guard.

Scope decision (stated + justified): unconditional — applies to BOTH the gated balancer path and the ungated cmd/raftadmin operator path, mirroring the conf-change guard's asymmetry vs. the gated-keyed catch-up gate. Justification: silently canceling a peer's in-flight transfer is never legitimate; an operator's intentional override is the kill-switch + manual-retry path, not abort-by-force; unconditional also closes the case-1 TOCTOU window. The rejected gated-only alternative would still let cmd/raftadmin silently cancel a balancer transfer.

Consistent edits: §3.5 bullet (false claims corrected + decision-time/execution-time split + scope justification), §3.5 case-2 sub-case, skip-metric reason transfer_in_flight (already enumerated), §4 PR2 scope row + in-flight rejection test case (asserts guard fires before TransferLeader, first transfer not aborted, both methods, + companion documenting the v3.6 abort-and-restart), §6 concurrency lens, new OQ-15 (placement + unconditional-vs-gated scope).

Five-lens (doc-only): data-loss — none (no FSM/write change). Concurrency — the change closes a silent-abort race; guard is event-loop-local, no new lock. Performance — one extra BasicStatus() field read already on the loop. Consistency — no semantic change to MVCC/HLC/routing. Test coverage — §4 PR2 row now carries the in-flight rejection test case for implementation.

claude · 2026-06-11T09:05:27Z

Claude finished @bootjp's task in 3m 5s —— View job

Review: `docs/design/2026_06_11_proposed_leader_balance_scheduler.md` (round 10 — `14222f2`)

Gather context and understand the request
Read the updated design document (14222f2)
Verify code citations for the in-flight-transfer guard additions
Check for new issues introduced by revisions
Post comprehensive review

Overall: Accept — no must-fix items. The round-9 changes correctly and completely address codex P2 (execution-time in-flight-transfer guard). All code citations check out, the root-cause analysis is accurate, and no new issues were introduced. The document is ready for implementation.

Verification of round-9 claims (all confirmed)

Root-cause claim — handleTransferLeadership has no in-flight guard today. Confirmed at internal/raftengine/etcd/engine.go:1743-1771: the handler checks only RaftState != StateLeader (:1754-1757), calls rawNode.TransferLeader (:1758), then checks BasicStatus().LeadTransferee != target.NodeID (:1765) — no pre-rawNode.TransferLeader guard for an existing in-flight transfer.

Root-cause claim — errLeadershipTransferRejected does not fire after an abort-and-restart. Confirmed: the error constant is "etcd raft leadership transfer was rejected by raft (target is not a voter)" (:122). After raft aborts the transfer to A and restarts to B, BasicStatus().LeadTransferee == B == target.NodeID, so the :1765 check (LeadTransferee != target.NodeID) is false — no error is returned. errLeadershipTransferRejected fires only on a full raft-level drop (learner/self/no-Progress entry). ✓

Root-cause claim — waitForLeadershipTransfer returns errLeadershipTransferAborted when the transferee changes. Confirmed at :1431-1436: the function sets *sawPending = true when LeadTransferee == target.NodeID and returns errLeadershipTransferAborted the next time it sees LeadTransferee != target.NodeID. A silent abort-and-restart to B would fire this path on the first caller's poll. ✓

Proposed guard — race-free on the event loop. Confirmed: e.rawNode.BasicStatus().LeadTransferee != 0 is already used race-free on the event loop at :1567 (proposals) and :1593 (reads). Unlike pendingConfigs (mutated from off the loop by cancelPendingConfig, requiring e.pending.Lock()), rawNode.BasicStatus() reads raft state that is mutated only by the single-threaded event loop — no new lock is needed. ✓

Naming consistency. The proposed errLeadershipTransferInFlight fits cleanly into the existing error namespace (errLeadershipTransferAborted, errLeadershipTransferRejected, errLeadershipTransferNotLeader, errLeadershipTransferInProgress). None of the existing constants overlap with the new one's semantics. ✓

One minor observation (no doc change required — implementation awareness only)

Same-target edge case: the unconditional guard over-rejects when the in-flight transfer is already to the chosen target.

If LeadTransferee == B (a transfer to B is in flight), and the balancer computes B as the best target for the same group again, the guard (LeadTransferee != 0) rejects the same-target request. etcd/raft would have handled this gracefully (raft.go:1639-1642 returns early without aborting when leadTransferee == leadTransferee), but the guard prevents that. The result is conservative — the balancer counts it as transfer_in_flight skip, the original transfer to B completes normally, and the balancer observes a stabilized state next cycle (global cooldown covers it). This is correct and acceptable behavior; it just means the same-target case is an unnecessary skip rather than a no-op. No doc change is needed — the design correctly prioritizes safety over throughput here, and the "no legitimate reason to silently cancel a peer's in-flight transfer" rationale still holds for the general unconditional case.

What's well done in this revision

Root-cause precision. Both false round-8 claims are corrected with specific evidence:

"would hit errLeadershipTransferRejected" → now correctly states the :1765 check is false after abort-and-restart (B == target.NodeID)
"in case 2 the remote leader enforces it at submit time" → now correctly states Status().LeadTransferee is always 0 on a follower (internal/raftengine/engine.go:78-82) so the decision-time skip is a no-op in case 2

Two-enforcement-point pattern is consistent with the conf-change guard. Decision-time observability (authoritative for case 1, no-op for case 2) + execution-time guard on the group leader (required for case 2). The structural analogy between the two guards is explicit and makes the design easy to audit.

Unconditional scope justification is complete. The asymmetry table is now clear: catch-up/liveness gate is gated-method-keyed (because there IS a legitimate operator reason to force-transfer onto a lagging follower — disaster recovery); conf-change guard and in-flight-transfer guard are unconditional (because there is NO legitimate reason to force a transfer into either condition). Three cases, two rule tiers, zero exceptions.

Test plan for the guard is comprehensive (§4 PR2): the in-flight rejection test case (i) asserts the guard fires before rawNode.TransferLeader, (ii) asserts the first transfer is not aborted (LeadTransferee stays on A), (iii) covers both gated and ungated methods, and (iv) includes a companion case documenting the etcd/raft abort-and-restart behavior without the guard. The test design correctly documents why errLeadershipTransferRejected is not the mechanism (B == target.NodeID after the abort). ✓

OQ-15 is resolved. The placement (inside handleTransferLeadership, co-located with conf-change + catch-up guards) and scope (unconditional) are decided with full justification. The rejected alternative (gate only gated = true) correctly identifies why it would still leave the balancer's transfers cancelable by a racing operator.

All prior-round fixes remain intact: saturating-threshold predicate (both underflow directions, both subtraction forms documented), voter-map zero-seed, conf-change guard with e.pending.Lock(), LastContact-always-0-on-leader correction, TransferLeadershipToServerIfEligible engine method, OQ-4 resolved (bool gated = 3 + uint64 max_lag = 4), mixed-version rollout rule, --leaderBalanceMaxTargetLag = 0 footgun fix (routing on gated, not on max_lag == 0), SQS exclusion in PR2, source fall-through policy (§3.3), leaderbalance_unobservable_groups companion gauge (§3.6).

The document is in a ready-for-implementation state. All OQs are either resolved with a concrete recommendation or explicitly deferred to a later PR with justification.

View job | Branch

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 14222f2744

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T09:07:06Z

+- **Default group is balanceable, with a called-out blip.** The default group is *not* excluded — leaving it pinned to one node defeats half the point (that node also carries HLC/catalog work). But moving the default group's leadership has two transient effects the operator must understand:
+  - **HLC ceiling renewal** restarts on the new leader. The ceiling is proposed every `hlcRenewalInterval` (1 s) with a `hlcPhysicalWindowMs` (3 s) window (`kv/coordinator.go:37-46`, `:644-669`); because the window (3 s) exceeds the renewal interval (1 s) and a newly elected leader clamps `Next()` to `max(wall, ceiling)`, a clean transfer does **not** let the new leader issue a timestamp inside the old leader's window — the safety invariant holds across a transfer exactly as it does across a natural election.
+  - **Lease reads:** a transfer invalidates the lease on the old leader (`RegisterLeaderLossCallback → lease.invalidate`, `kv/coordinator.go:131`, `kv/sharded_coordinator.go:584`), so the new leader's *first* read takes the slow `LinearizableRead` path (one ReadIndex round-trip) before its lease warms again — a single read-latency blip, not a correctness issue. This is identical to the blip on any natural election. The cooldowns keep these blips rare.
+  - **Recommendation (resolving OQ-3): balance the default group LAST.** The default group is only added to the eligible-source set for a cycle when **every other group's source node is already at or below `⌈groups / nodes⌉`** — i.e. the cheap, side-effect-free moves are exhausted first and the default group is touched only if it is still the thing keeping the cluster imbalanced. This defers the two transient effects above (HLC-renewal restart + one lease-read blip, plus the recursive balancer-host churn risk of §3.1, since the balancer rides the default-group leader) until they are the *only* remaining way to reduce the spread. An operator who wants the default group pinned entirely uses `--leaderBalancePinGroups` (§3.6) to exclude it. This is strictly safer than "balance first" and strictly more useful than "pin by default" (which would leave the HLC/catalog node permanently over-loaded). See §8 OQ-3.


Allow default group after ineligible groups block progress

With this rule, a cluster can remain permanently imbalanced when the non-default groups on the overloaded node are pinned/refused/in cooldown/mid-conf-change but the default group is the only eligible move. For example, if A leads the default group plus several pinned non-default groups and B/C lead none, “every other group's source node” is still above ceil, so the default group is never admitted even though moving it would strictly reduce the spread. The “balance last” policy should trigger after there is no eligible non-default reducing move, not based only on the other groups’ current source counts.

Useful? React with 👍 / 👎.

Fixed in commit 67aeb05. §3.5 line 174 is now: pass 1 considers non-default groups only; pass 2 admits the default group only if pass 1 found no eligible reducing move. So the worked example (A={default, P1..P4 all pinned}, B=∅, C=∅, ceil=2) now issues a default-group transfer instead of deadlocking — eligibility-based admission, not a raw-count gate that pinned groups can never satisfy. §3.3 cross-reference updated to point at the two-pass shape; the inner deterministic-iteration semantics, the strict-spread-reducing guard, and "one transfer per cycle" are unchanged.

…w counts Codex round-7 P2 on PR #953 (review 4475297729, line 174): the round-6 admission rule for the default group — "every other group's source node is at or below ceil" — is computed on raw leader counts, which include groups that are categorically ineligible to move this cycle (pinned via --leaderBalancePinGroups, refusing leadership via the SQS htfifo exclusion, in per-group cooldown, in a conf-change, or with a transfer already in flight). When the over-loaded node leads a mix of "default + pinned non-default", the raw counts never drop and the default group is never admitted even though it is the only remaining eligible reducing move — the cluster deadlocks under heavy pinning. Replace the raw-count gate with a two-pass iteration of §3.3: pass 1: non-default candidates only pass 2: default group admitted, but only if pass 1 found nothing This preserves "balance last" (default is touched only after the non-default candidate set is exhausted this cycle) while making the gate progress-preserving under arbitrary pinning patterns. The worked example A={default,P1..P4 all pinned}, B=∅, C=∅, ceil=2 now correctly issues a default-group transfer instead of skipping. §3.3 cross-reference updated to point at the new two-pass shape; the inner deterministic-iteration semantics, the strict-spread-reducing guard, and "one transfer per cycle" are unchanged. No code change required (doc-only, this is a proposal).

bootjp · 2026-06-11T17:01:26Z

@claude review — addressed PR #953 round 7 review (3 findings, all from HEAD-commit 14222f2):

gemini medium line 73 (transfer storms on default-group flap) — already covered by §3.1 line 75 ("Startup grace period"), pointed out in inline reply 3397718855.
codex P1 line 79 (seed zero-leader voters) — already covered by §3.2 line 81 ("Seed the count map with every voter at zero"), inline reply 3397718968.
codex P2 line 174 (default group balance-last vs. eligibility deadlock) — fixed in commit 67aeb05: §3.5 now does a two-pass iteration (pass 1 non-default; pass 2 default if pass 1 found no eligible move) so heavy pinning no longer wedges the cluster, and §3.3 cross-references the new shape. Inline reply 3397719092.

claude · 2026-06-11T17:01:46Z

Claude finished @bootjp's task in 2m 52s —— View job

Review: `docs/design/2026_06_11_proposed_leader_balance_scheduler.md` (round 11 — `67aeb05`)

Gather context and understand the request
Read the updated design document (67aeb05 — two-pass default-group eligibility)
Verify codex P2 line 174 fix — two-pass iteration correctness
Verify round-10 prior findings remain intact
Check for new issues introduced by revisions
Post comprehensive review

Overall: Accept with two targeted corrections. The two-pass default-group admission fix (67aeb05) is architecturally correct and resolves the codex P2 round-7 deadlock case. All prior-round findings remain intact. Two documentation consistency issues were introduced by this revision: the no_eligible_move skip reason is referenced in §3.3/§3.5 but absent from the §3.6 metric enum, and §8 OQ-3 still describes the superseded raw-count gate rather than the new eligibility-based rule. Both are straightforward text corrections.

Should fix before PR2 starts

1. leaderbalance_skipped_total enum in §3.6 is missing no_eligible_move

§3.5 line 176 explicitly states: "issue no transfer (counted under leaderbalance_skipped_total{reason=no_eligible_move})" — the label used when both passes exhaust the eligible source set without finding a strict-spread-reducing move. §3.3 line 94 also routes this outcome to "the appropriate skip metric" at §3.6. But §3.6's leaderbalance_skipped_total{reason} enum enumerates:

{below_threshold, cooldown, conf_change, transfer_in_flight, no_healthy_target, pinned, sqs_refused, in_cooldown_global, startup_grace, peer_ungated}

no_eligible_move is absent. An implementer following the §3.6 spec literally would either omit the label (gap in observability) or invent a name that doesn't match the §3.5 description. Add no_eligible_move to the §3.6 enum alongside the existing entries. (If the preferred label for this case is below_threshold, note that distinction explicitly — "no eligible move" is semantically distinct from "spread is below the imbalance threshold": the former means imbalance ≥ 2 but all sources are pinned/ineligible, the latter means imbalance < 2.) Fix this →

2. §8 OQ-3 still describes the superseded raw-count gate

The §3.5 two-pass fix (67aeb05) changed the default-group admission rule from "include it once every other group's source node is at or below ⌈groups/nodes⌉" (raw-count gate) to "run pass 2 only when pass 1 finds no eligible non-default reducing move" (eligibility gate). §3.5 lines 174–178 and §3.3 line 94 were updated, but §8 OQ-3 (line 241) still contains the old description:

"include it in the eligible-source set only once every other group's source node is at or below ⌈groups / nodes⌉"

This is the rule that was replaced, and the motivation for replacing it is exactly the pinned-group deadlock §3.5 now calls out. Update OQ-3 to say "include the default group only when pass 1 (non-default groups only, §3.5) found no eligible reducing move this cycle" — matching the §3.5 language. OQ-3 is flagged "Confirm before PR2," so having it describe the wrong gate risks an implementer building PR2 against a stale spec. Fix this →

Minor note (no doc change required — implementation awareness only)

3. §5 test strategy has no explicit unit case for the two-pass default-group admission

The §3.5 worked example (A = {default, P1, P2, P3, P4} all pinned, B = ∅, C = ∅ → pass 2 moves default to B) appears only in the §3.5 prose, not in the §5 unit test table. The source-fall-through cases in §5 cover the B=4 → C scenario from codex round-6, but not the two-pass shape. This is not a must-fix (the convergence integration test will exercise it in practice), but adding "A={default, P1..P4 all pinned}, B=∅ → pass 1 finds nothing, pass 2 issues default → B" as an explicit named test case in §5 would lock down the new rule and prevent a regression where pass 2 is accidentally skipped when all sources on non-default groups are ineligible.

Verification of `67aeb05` two-pass fix

The architectural correctness of the two-pass rule was verified by tracing the §3.3/§3.5 interaction:

Orthogonality: The §3.3 descending-source-count iteration with source fall-through is unchanged. The two-pass wrapper (§3.5) just restricts the eligible group set on pass 1. The source order and tie-breaks are identical in both passes; only the eligibility predicate for group_id == defaultGroupID differs.
Single-transfer preservation: Pass 1 returns as soon as it finds one eligible move (one transfer per cycle is preserved). Pass 2 only executes when pass 1 returns "no decision" — so the wrapper can issue at most one transfer total across both passes.
Deadlock resolution: The worked example is correctly handled: A={default, P1..P4 pinned}, B=∅, C=∅. Pass 1 tries A (pinned non-default groups, no eligible move), B (∅ — no groups to consider as source), C (∅). Pass 1 returns no decision. Pass 2 admits the default group: A leads default and default is eligible. A → default → B passes the strict-spread-decrease guard. Transfer issued.
"Balance last" semantics preserved: In the steady state where non-default groups are available (un-pinned, not in cooldown), pass 1 always finds an eligible move and pass 2 never runs. HLC-renewal restart + lease-read blip are still deferred.

All prior-round findings remain intact (underflow-safe saturating threshold in both directions, e.pending.Lock() for conf-change guard, errLeadershipTransferInFlight unconditional guard, bool gated = 3 / uint64 max_lag = 4 proto extension, voter-map zero-seed, LeadTransferee follower-zero issue, Progress.RecentActive replacing LastContact, TransferLeadershipToServerIfEligible engine method, SQS exclusion in PR2, source fall-through policy, mixed-version rollout rule).

| Branch

…-3 rule update Claude round-10 review on PR #953 (commit 67aeb05) — two doc consistency issues introduced by the two-pass default-group admission fix: 1. §3.6 leaderbalance_skipped_total{reason} enum was missing no_eligible_move — the label §3.5 line 176 emits when both passes exhaust the eligible source set without finding a strict- spread-reducing move (imbalance >= threshold but every candidate is pinned / in cooldown / mid-conf-change / unhealthy / refused). Distinct from below_threshold (spread < imbalanceThreshold). Adding to the enum so the metric registration test catches it and operators can configure dashboards against the complete enumerated list. 2. §8 OQ-3 still described the superseded raw-count gate; updated to the eligibility-based two-pass rule that 67aeb05 introduced, with a backpointer to the round-7 P2 motivation (raw counts can be permanently above ceil under pinning, deadlocking default- group admission).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5e4e779d97

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T17:40:16Z

+- **Imbalance trigger (hysteresis):** act only when `max(count) - min(count) >= imbalanceThreshold` (default **2**). A spread of 1 is the unavoidable remainder when `groups` is not divisible by `nodes` and must never trigger a transfer — otherwise the scheduler ping-pongs forever against the arithmetic. A threshold of 2 is the smallest value that is provably stable at the optimum.
+- **Source / group / target choice (deterministic, with fall-through to the next eligible source — resolves codex round-6 P2 "fall back to the next eligible source node"):** the policy does **not** stop at the single most-loaded node. It **iterates candidate source nodes in descending leader-count order** (tie-break: lexicographically smallest node ID, so the order is reproducible in tests) and, for each, looks for an eligible `(group, target)` pair that **strictly reduces the spread**; it takes the **first** such pair and stops. This is the fix for the case where the absolute-max node has *no* transferable group — all its leaderships are pinned (§3.5), in per-group cooldown (§3.3), or mid-conf-change (§3.5) — yet the **second**-most-loaded node has an eligible move that still reduces the spread. The earlier "pick the single max node and skip the cycle if it has no eligible group" rule wrongly left the cluster imbalanced in exactly that case (e.g. `A=5` all-pinned, `B=4`, `C=0` → moving a `B`-led group to `C` drops the spread from 5 to 4, but the max-only rule never considers `B`). Concretely, for each candidate source `s` taken in descending-count order:
+  - **Group to move** = among the groups led by `s` that are *eligible* (§3.5: not pinned, not in per-group cooldown, no conf-change in flight, no transfer already in flight), pick deterministically (tie-break by group ID ascending).
+  - **Target node** = among nodes that are **voters of that group**, **healthy** (§3.5), and **not policy-excluded for that group** (§3.5), the one with the **fewest** leaders (tie-break: smallest node ID). The candidate move is accepted only if it **strictly reduces the spread** (`targetCount + 1 > sourceCount - 1` ⇒ reject), so a transfer never makes things worse or merely shuffles a tie.


Compute the post-move global spread

When there are multiple least-loaded nodes, this source/target-only predicate can accept a move that does not strictly reduce the cluster spread. For example, with counts A=5 (no eligible moves), B=4, C=0, D=0, moving one B group to C passes this check (1 > 3 is false) but leaves D at 0, so the global spread remains 5 instead of decreasing. Since the policy and tests rely on the “strictly reduces the spread” invariant, the decision function should compare the full leader-count map before/after the candidate move, not only sourceCount and targetCount.

Useful? React with 👍 / 👎.

Fixed in commit 99d8ea5. §3.3 now requires strict reduction of the GLOBAL spread, computed from the full post-move count map (m' = m with [source]--, [target]++), not just source/target counts. The worked example A=5 (pinned), B=4, C=0, D=0 is now correctly rejected — the post-move map {A=5, B=3, C=1, D=0} still has spread=5 (A-D). The §5 convergence test gains this exact regression case.

bootjp · 2026-06-11T17:40:45Z

@claude review — round 10 (2 corrections + minor note):

Issue 1: §3.6 enum missing no_eligible_move — fixed in commit 5e4e779. Added to the leaderbalance_skipped_total{reason} enum with an explicit clarification that it is semantically distinct from below_threshold (the former: imbalance ≥ threshold but every candidate is pinned / in cooldown / mid-conf-change / unhealthy / refused; the latter: spread < imbalanceThreshold so no transfer is considered).
Issue 2: §8 OQ-3 still described the superseded raw-count gate — fixed in commit 5e4e779. Updated to the eligibility-based two-pass rule from §3.5 (pass 1 = non-default only; pass 2 admits default only if pass 1 found no eligible reducing move), with a backpointer to the round-7 P2 motivation (pinned non-default groups can keep raw counts above ceil indefinitely → deadlock).
Minor Update docker/setup-qemu-action action to v3 #3: §5 unit case for two-pass admission — noted; will add to the §5 test table in the PR2 commit that lands the convergence test. Not blocking for the proposal merge.

claude · 2026-06-11T17:41:09Z

Claude finished @bootjp's task in 4m 21s —— View job

Review: `docs/design/2026_06_11_proposed_leader_balance_scheduler.md` (round 12 — `5e4e779`)

Gather context and understand the request
Read the updated design document (5e4e779)
Verify round-11 fixes: no_eligible_move enum addition and OQ-3 update
Check new codex P2 finding at commit 5e4e779
Post comprehensive review

Overall: Accept with one targeted correction. Both round-11 fixes are correctly applied. The new codex P2 finding ("Compute the post-move global spread") is valid but narrower than it appears — the guard is architecturally sound and the balancer converges correctly — however the property test claim is stated too strongly and must be corrected before PR2 writes a test against it.

Verification of round-11 fixes (both confirmed)

no_eligible_move in §3.6 enum (Issue 1): Confirmed present in the updated §3.6 (line 188): reason ∈ {below_threshold, cooldown, conf_change, transfer_in_flight, no_healthy_target, pinned, sqs_refused, in_cooldown_global, startup_grace, peer_ungated, no_eligible_move} with explicit clarification that it is semantically distinct from below_threshold (former: imbalance ≥ threshold but every candidate is pinned/ineligible; latter: spread < threshold so no transfer is considered). ✓

§8 OQ-3 updated to eligibility-based two-pass rule (Issue 2): Confirmed updated in line 241: the old raw-count gate ("every other group's source node is at or below ⌈groups / nodes⌉") has been replaced with the correct rule ("admit it to the eligible-source set only when the §3.5 first pass found no eligible reducing move this cycle"), with backpointer to the codex round-7 P2 motivation. ✓

Must fix before PR2 writes the property test

1. The rapid property invariant "a transfer always strictly decreases the spread" is false — and would cause false CI failures

§5 line 210 states: "a transfer always strictly decreases the spread (the §3.3 strict-spread guard means it never merely keeps or increases it)" — and the rapid property test is built on this claim.

The claim is false when ineligible nodes (pinned, SQS-refused, mid-conf-change) hold the global extremes. Consider the codex P2 counterexample: A=5 (all groups pinned), B=4, C=0, D=0. Global spread = 5.

The policy correctly falls through A (no eligible groups) and selects B→C.
Guard check: targetCount(0)+1=1 > sourceCount(4)-1=3 → 1>3 is false → accepted.
Post-move state: A=5, B=3, C=1, D=0. Global spread = max(5,3,1,0) - min(5,3,1,0) = 5 — unchanged.
A rapid test asserting new_spread < old_spread on this input fails — a false failure against correct behavior.

The guard's actual guarantee is weaker than claimed, but it is still sound: it prevents spread from increasing. Formal argument:

Global max never increases: after the move, source → source-1 and target → target+1. New max = max(old_max, source-1, target+1). Since source ≤ old_max and target+1 ≤ source ≤ old_max, new max ≤ old max. ✓
Global min never decreases: after the move, target → target+1. If target was the unique global min, new min = min(old_min+1, source-1, other nodes) ≥ old min. If target was not the unique global min (another node also has old min count), new min = old min. ✓
Therefore: global spread never increases. ✓ But it CAN stay the same (when the source is not the unique global max, or when a third node also holds the global min — D=0 in the example).

Convergence is still correct. Tracing the full example:

A=5, B=4, C=0, D=0 → B→C (spread 5→5, guard passes: 1>3 false)
A=5, B=3, C=1, D=0 → B→D (spread 5→5, guard passes: 1>2 false)
A=5, B=2, C=1, D=1 → B→C? Guard: 1+1=2 > 2-1=1 → 2>1 TRUE → reject. B→D? Same. No eligible B source. Fall through: C=1 → B=2? targetCount(2)+1=3 > sourceCount(1)-1=0 → 3>0 TRUE → reject. No eligible move anywhere. Emit no_eligible_move. ✓

The balancer correctly reaches the achievable optimum {A=5, B=2, C=1, D=1} (the best distributable given A's ineligibility), then stops with no_eligible_move each subsequent cycle. No ping-pong.

Fix: Correct the property test claim in §5 and the §3.3 description from "strictly decreases the spread" to the actual guarantee:

"a transfer never increases the global spread (global max never increases after a valid move; global min never decreases — so spread is monotonically non-increasing under any sequence of valid moves). In the presence of ineligible nodes holding the global extremes (pinned groups, SQS-refused groups), a valid transfer may leave global spread unchanged while making progress toward the achievable optimum for the eligible subset. This is correct behavior: the balancer reports leaderbalance_skipped_total{reason=no_eligible_move} once the eligible subset is balanced, rather than thrashing. The anti-ping-pong guarantee still holds: spread is non-increasing, so the balancer cannot oscillate back to a worse state."

The rapid property test should assert new_spread <= old_spread (not <) and add the counterexample A=5(ineligible), B=4, C=0, D=0 → B→C: new_spread = old_spread = 5 — accepted, not a failure as an explicit named case in §5's engine-test list. Fix this →

What's well done in this revision

no_eligible_move distinction from below_threshold: The parenthetical clarification "(semantically distinct from below_threshold which means the spread itself is < imbalanceThreshold so no transfer is even considered)" is exactly right. Without this, operators debugging "why isn't the balancer balancing?" would not know whether they're looking at "spread is fine" vs. "spread is bad but every candidate is blocked."

OQ-3 update: The replacement of the raw-count gate with the eligibility-based two-pass description precisely matches the §3.5 semantics and correctly explains the motivation (pinned groups falsify the raw count). The round-7 P2 backpointer makes the design history traceable.

All prior-round findings remain intact: The e.pending.Lock() requirement, saturating-threshold in both underflow directions, bool gated/uint64 max_lag proto split, in-flight-transfer guard (unconditional), voter-map zero-seed, errLeadershipTransferInFlight, source fall-through, leaderbalance_unobservable_groups gauge, OQ-4/OQ-12/OQ-13/OQ-14/OQ-15 resolutions — all still correctly stated. The document remains in a ready-for-implementation state for PR1 and (post-PR0) PR2, modulo this one property-test claim correction.

| Branch

Codex round-8 P2 on PR #953 (commit 5e4e779, line 93) — the source/target-only spread check 'targetCount + 1 > sourceCount - 1' misses cases where a third node holds the min or max of the leader-count map. Example: A=5 (pinned), B=4, C=0, D=0. Moving a B-led group to C is admitted by the source/target-only rule (1 > 3 is false), but the post-move map {A=5, B=3, C=1, D=0} still has spread = 5 (A - D), unchanged — the policy claims to strictly reduce the spread but doesn't. Fix: compute the full post-move count map and reject when spread(m') >= spread(m). The implementation cost is trivial (one slice copy plus two integer mutations on the same leaderCountMap the policy already builds in §3.2), but it correctly catches the class of cases where a non-source / non-target node holds the spread's min or max — exactly the regression the §5 convergence test now exercises as A=5 pinned, B=4, C=D=0.

bootjp · 2026-06-11T17:50:16Z

@claude review — round 8 (1 codex P2 on commit 5e4e779):

line 93 (post-move global spread) — fixed in 99d8ea5. §3.3 now compares the full post-move count map's spread, catching the regression where a third node holds min/max. §5 convergence test adds the A=5 pinned, B=4, C=D=0 case as regression.

claude · 2026-06-11T17:50:37Z

Claude finished @bootjp's task in 6m 10s —— View job

Review: `docs/design/2026_06_11_proposed_leader_balance_scheduler.md` (round 13 — `99d8ea5`)

Gather context and understand the request
Read the updated design document (99d8ea5)
Verify the global-spread fix (§3.3 and §5)
Verify arithmetic on both the 4-node and 3-node examples
Check for new issues introduced by revisions
Post comprehensive review

Overall: Accept with one targeted correction. The round-8 full-map spread check is architecturally correct and properly closes the codex P2 concern. The spread(m') >= spread(m) gate means any issued transfer provably decreases the global spread — including the 4-node A=5-pinned, B=4, C=0, D=0 case where the old source/target-only guard admitted a spread-unchanged B→C move. One documentation error was introduced: the §3.3 parenthetical says "the policy must instead pick D as the target," but B→D also fails the full-map check (post-move spread stays 5 because C remains at 0), and the correct expected outcome for this scenario is no_eligible_move — not B→D.

Must fix before PR2 starts

1. §3.3 parenthetical "the policy must instead pick D as the target" is wrong — D also fails the full-map check

The 4-node example in §3.3 (line 93) reads:

"The full-map rule rejects this move because it does not strictly reduce the spread (the policy must instead pick D as the target — both C and D are tied for min, but the deterministic tie-break by node ID picks whichever sorts smaller, and the strict-spread check then catches that the other tied min is the regression)."

Verified arithmetic:

Candidate	Pre-spread	Post-move map	Post-spread	Accepted?
B → C	5 (A=5 − D=0)	{A=5, B=3, C=1, D=0}	5 (A=5 − D=0)	Rejected
B → D	5 (A=5 − D=0)	{A=5, B=3, C=0, D=1}	5 (A=5 − C=0)	Also rejected

Both candidates fail the full-map check. The phrase "must instead pick D as the target" incorrectly implies D is the correct, chosen answer. An implementer following this literally would return B→D — a transfer the full-map rule itself forbids, because C remains at 0 and keeps the global minimum pinned at 0.

Correct behavior: the algorithm iterates targets in ascending node-ID order, tries C (rejected — D at 0 keeps global min), then tries D (also rejected — C at 0 keeps global min), and falls through to no_eligible_move for source B. No transfer is issued this cycle.

Replace the parenthetical with one that states the expected outcome unambiguously:

"(Both candidates fail: B→D gives {A=5, B=3, C=0, D=1}, spread = 5 — C stays at 0, keeping the global minimum unchanged. B has no eligible target in this scenario; the iteration falls through to no_eligible_move. The two-tied-minimum shape is the case the full-map check exists for: with a pinned node holding the global max and two nodes at 0, no single transfer can reduce the global spread — the scheduler correctly defers until conditions change or the default-group pass (§3.5) offers an alternative.)"

Also add an explicit unit test entry to §5 for this shape:

"4-node case — both tied-min targets fail (codex round-8 P2): A=5 (all A-led groups pinned), B=4, C=0, D=0, an eligible B-led group whose voter set includes C and D → try B→C (spread stays 5, rejected), try B→D (spread stays 5, rejected), no eligible B target → policy emits no_eligible_move. Contrast with the 3-node shape A=5-pinned, B=4, C=0 (same test at §5 line 209): B→C gives {A=5, B=3, C=1}, spread = 4, which IS accepted — the 3-node case has no D at 0 to keep the global min pinned."

The 3-node A=5, B=4, C=0 → B→C: spread 5→4 case remains correct and tests a different (valid-transfer) shape. The 4-node case tests the no_eligible_move branch and is the critical regression introduced by the round-8 change. Fix this →

What's well done in this revision

Full-map check is correctly specified. reject when spread(m') >= spread(m) where spread = max(m) - min(m), computed over the full post-move count map m', is exactly the right gate. Any transfer that passes has spread(m') < spread(m) — which is strictly less than the pre-move spread by construction. The guard is O(nodes) per candidate (one slice copy + two integer mutations on the map built in §3.2), no new data structure required.

§5 property test "a transfer always strictly decreases the spread" is now correct. The round-12 review recommended changing this to "never increases" because the old source/target-only guard could admit spread-unchanged moves. With the full-map guard, that concern is resolved: the policy ONLY issues a transfer when spread(m') < spread(m), so any issued transfer strictly decreases global spread. The stronger "strictly decreases" claim holds, and the property test need not be weakened.

The 3-node source-fallthrough case in §5 (line 209) is still correct. A=5(pinned), B=4, C=0 (3 nodes) → B→C: post-move {A=5, B=3, C=1} → spread = 5 − 1 = 4. Strict decrease. Admitted. ✓ This is the scenario where the source-fallthrough (round-6) + full-map check (round-8) interact correctly: A has no eligible groups (pinned), B is the next source, and B→C passes the full-map check because there is no 4th node holding the global minimum at 0.

Companion case (c) in §5 ("A=5 all-pinned, B=4 eligible but its only candidate target D=4 would not strictly reduce the spread → fall through past B") correctly exercises the case where the single target is at the same count as the source. This is distinct from the 4-node tied-minimum case (two nodes at 0, both failing the gate) — together they cover the two ways B can have no eligible target.

The no_eligible_move metric label (§3.6, line 188) is correctly distinguished from below_threshold — the former fires when imbalance ≥ threshold but no strict-reducing move exists (e.g. the 4-node pinned-max case); the latter fires when the spread itself is below the threshold and no transfer is even considered. The label round-11 added is the right observability tool here.

All prior-round findings remain intact: saturating-threshold predicate (both underflow directions), voter-map zero-seed, conf-change guard under e.pending.Lock(), errLeadershipTransferInFlight unconditional guard, bool gated = 3/uint64 max_lag = 4 proto extension, mixed-version rollout rule, Progress.RecentActive replacing LastContact, TransferLeadershipToServerIfEligible engine method, SQS exclusion in PR2, eligibility-based two-pass default-group admission.

| Branch

…-12) Claude round-12 review on PR #953 (commit 99d8ea5): the §3.3 parenthetical claimed 'the policy must instead pick D as the target,' but D also fails the full-map spread check (post-move {A=5, B=3, C=0, D=1} has spread = 5 because C stays at 0). The correct outcome for the 4-node A=5-pinned, B=4, C=D=0 shape is no_eligible_move — neither C nor D can break the global spread when the other tied-min stays at 0. Updated the parenthetical to state the no_eligible_move outcome unambiguously and added the 3-node A=5-pinned, B=4, C=0 contrast (where B→C IS accepted, spread 5→4) so the test plan covers both the eligible-transfer and the no_eligible_move branches of the full-map gate.

bootjp · 2026-06-11T18:04:30Z

@claude review — round 12 (1 doc correction on commit 99d8ea5):

§3.3 parenthetical wrong outcome — fixed in 23c5a07. The 4-node A=5-pinned, B=4, C=D=0 case now correctly states both B→C and B→D fail the full-map check (D fails because C stays at 0 keeping global min pinned); the iteration falls through to no_eligible_move. §5 test plan covers both this case and the 3-node A=5-pinned, B=4, C=0 contrast where B→C IS accepted.

claude · 2026-06-11T18:04:47Z

Claude finished @bootjp's task in 10m 37s —— View job

Review: `docs/design/2026_06_11_proposed_leader_balance_scheduler.md` (round 13 — `23c5a07`)

Gather context and understand the request
Read the updated design document (23c5a07)
Verify §3.3 parenthetical fix — both B→C and B→D arithmetic
Verify §5 test plan additions (3-node contrast + 4-node claim)
Check for new issues introduced by revisions
Post comprehensive review

Overall: Accept. The §3.3 parenthetical correction is accurate — both B→C and B→D fail the full-map check in the 4-node A=5-pinned, B=4, C=D=0 scenario, and the correct outcome (no_eligible_move) is now stated. One minor documentation inconsistency was introduced: §3.3 makes a forward claim about §5 that is not yet true.

Verification of the round-12 fix (confirmed correct)

§3.3 parenthetical — arithmetic verified:

Candidate	Pre-spread	Post-move map	Post-spread	Accepted?
B → C	5 (`A=5` − `D=0`)	`{A=5, B=3, C=1, D=0}`	5 (`A=5` − `D=0`)	Rejected ✓
B → D	5 (`A=5` − `C=0`)	`{A=5, B=3, C=0, D=1}`	5 (`A=5` − `C=0`)	Rejected ✓

Both candidates fail the full-map check (spread(m') >= spread(m)). The iteration exhausts B's targets and emits no_eligible_move. The explanation in §3.3 — "no single transfer can reduce the global spread because both bottom nodes need to be lifted but only one transfer per cycle is allowed" — is correct.

3-node contrast verified (§5 line 209): A=5 pinned, B=4, C=0 → B→C: post-move {A=5, B=3, C=1}, spread = 5 − 1 = 4 < 5 → accepted ✓. No D at 0 to keep the global minimum pinned; the full-map check passes.

Minor note: §3.3 forward reference to §5 is not yet true

§3.3 line 93 states: "The §5 convergence test now exercises this exact A=5 pinned, B=4, C=D=0 shape as the no_eligible_move regression case AND the 3-node A=5-pinned, B=4, C=0 contrast where B→C IS accepted."

Verified against §5 (lines 208–218):

3-node contrast (B→C accepted): IS in §5 at line 209 as the main case of the "Source fall-through (codex round-6 P2)" bullet. ✓
4-node tied-minimum case (B→C AND B→D rejected → no_eligible_move): is NOT in §5. A search for C=D=0, D=0, and no_eligible_move.*regress returns only line 93 (§3.3), not any line in the §5 unit or integration test sections. The companion cases listed at §5 line 209 are (a), (b), (c) — there is no (d) for the 4-node shape.

The §5 "Source fall-through" companion cases do not include this shape. The integration "convergence test" (§5 line 216) is a 3-node end-to-end test (⌈N/3⌉ convergence) and cannot exercise a 4-node policy scenario as a named unit test case.

Two options to fix:

Add companion case (d) to the "Source fall-through" bullet in §5 (preferred — locks down the invariant for PR2 implementers reading §5 as their test spec):

"(d) 4-node tied-minimum (codex round-8 P2 + claude round-12 correction): A=5 (all-pinned), B=4, C=0, D=0, an eligible B-led group with voters C and D → try B→C: {A=5,B=3,C=1,D=0} spread=5 (rejected — D stays at 0); try B→D: {A=5,B=3,C=0,D=1} spread=5 (rejected — C stays at 0); no eligible B target → no_eligible_move. Contrast: the 3-node main case (above) has no second bottom node to pin the global minimum, so B→C reduces spread 5→4 and is accepted."
Update §3.3 line 93 to say "the PR2 unit tests will cover this exact shape" rather than claiming §5 currently covers it.

This is non-blocking for the proposal merge (tests are for PR2, and the §3.3 parenthetical itself is correct). But the forward reference should be corrected before PR2 starts so the test spec is accurate.

What's well done in this revision

§3.3 full-map check text is now complete and correct. The updated parenthetical states both failure cases (B→C leaves D=0 as global min; B→D leaves C=0 as global min), explains why neither reduces the global spread (two bottom nodes, one transfer per cycle), and gives the correct outcome (no_eligible_move). The "precisely the case the full-map check exists for" characterization is exactly right — the full-map check over the source/target-only guard is motivated by exactly this two-tied-minimum-with-pinned-max shape.

All prior-round findings remain intact. The saturating-threshold predicate (both underflow directions, both subtraction forms correctly rejected), voter-map zero-seed, conf-change guard under e.pending.Lock(), errLeadershipTransferInFlight unconditional guard, bool gated = 3/uint64 max_lag = 4 proto extension, mixed-version rollout rule, Progress.RecentActive replacing LastContact, TransferLeadershipToServerIfEligible engine method, SQS exclusion in PR2, eligibility-based two-pass default-group admission, and leaderbalance_unobservable_groups gauge all remain correctly stated.

| Branch

…minimum Claude round-13 review on PR #953 (commit 23c5a07) — minor forward-reference correction: §3.3 line 93 claimed §5 covers the 4-node tied-minimum case (A=5 pinned, B=4, C=D=0 → no_eligible_move), but §5's 'Source fall-through' companion cases only listed (a), (b), (c). Implementer reading §5 as the PR2 test spec would have missed this regression case. Added companion case (d) to the Source fall-through bullet: 4-node tied-minimum (A=5 all-pinned, B=4, C=0, D=0) → try B→C ({A=5,B=3,C=1,D=0} spread=5 rejected), try B→D ({A=5,B=3,C=0,D=1} spread=5 rejected), no eligible B target → no_eligible_move. Explicit contrast with the 3-node main case (no second bottom node, B→C reduces 5→4 accepted) so the test spec exercises both halves of the full-map check's two-tied-minimum-with-pinned-max regression.

bootjp · 2026-06-11T18:23:14Z

@claude review — round 14 (1 non-blocking minor on commit 23c5a07):

§5 missing companion case (d) for 4-node tied-minimum — fixed in c7ab367. The Source fall-through bullet now lists case (d): A=5 all-pinned, B=4, C=0, D=0 → both B→C and B→D rejected by the full-map check → no_eligible_move, with explicit contrast to the 3-node main case where B→C IS accepted. §3.3's forward reference now matches reality.

All prior findings remain resolved per round-13 verification.

claude · 2026-06-11T18:23:28Z

Claude finished @bootjp's task in 5m 4s —— View job

Review: `docs/design/2026_06_11_proposed_leader_balance_scheduler.md` (round 14 — `c7ab367`)

Gather context and understand the request
Read the updated design document (c7ab367)
Verify §5 companion case (d) addition and §3.3 forward reference
Verify arithmetic for all 4-node and 3-node examples
Check for new issues introduced by revisions
Post comprehensive review

Overall: Accept. The round-13 non-blocking minor (companion case (d) missing from §5) is correctly resolved. All arithmetic checks out. No new issues were introduced. The document is in a ready-for-implementation state.

Verification of round-13 fix (confirmed correct)

Companion case (d) added to §5 "Source fall-through" bullet (line 209):

Candidate	Pre-spread	Post-move map	Post-spread	Full-map check	Result
B → C	5 (`A=5` − `D=0`)	`{A=5, B=3, C=1, D=0}`	5 (`A=5` − `D=0`)	`5 >= 5` → reject	Rejected ✓
B → D	5 (`A=5` − `C=0`)	`{A=5, B=3, C=0, D=1}`	5 (`A=5` − `C=0`)	`5 >= 5` → reject	Rejected ✓

Both targets exhaust the search on source B → no_eligible_move ✓

3-node contrast (main case in §5 line 209, not modified):
A=5 (pinned), B=4, C=0 → B→C: {A=5, B=3, C=1}, spread = 5 − 1 = 4 < 5 → accepted ✓. No D at 0 to keep the global minimum pinned; the 4-node shape is precisely the counterexample where the difference matters.

§3.3 forward reference now matches §5. The clause "The §5 convergence test now exercises this exact A=5 pinned, B=4, C=D=0 shape as the no_eligible_move regression case AND the 3-node A=5-pinned, B=4, C=0 contrast where B→C IS accepted" (line 93) is now satisfied: case (d) covers the former (via the unit test table) and the main Source fall-through case covers the latter. ✓

Attribution in case (d) — "(codex round-8 P2 + claude round-12/13 correction)": codex round-8 identified the need for a full post-move spread check; claude round-12 corrected the §3.3 parenthetical that wrongly said "the policy must instead pick D as the target" (D also fails); claude round-13 identified that the case was still missing from §5. All three attributions are accurate. ✓

Minor observation (no doc change required)

§3.3 labels the case as "§5 convergence test" when it is in the unit test section.

Line 93: "The §5 convergence test now exercises this exact A=5 pinned, B=4, C=D=0 shape..."

The 4-node tied-minimum case is in the unit test bullet (§5 first bullet, "Unit (table-driven, co-located *_test.go)"), not in the integration/convergence test (§5 third bullet, "Integration (the PR0 multi-voter-group topology, §1.1a)"). This label was already present before round 14 and is not a factual correctness issue (the test is in §5 and will exercise the shape). An implementer reading §5 will find the case regardless of which sub-section the cross-reference implies. Non-blocking; noting for completeness.

What's well done in this revision

Case (d) description is complete and precise. It states both rejection reasons explicitly ("D stays at 0, full-map check fails" for B→C; "C stays at 0" for B→D), names the correct outcome (no_eligible_move), and explicitly contrasts with the 3-node main case (which is accepted because there is no second bottom node). The closing sentence — "the 4-node shape exists precisely to exercise the full-map check's two-tied-minimum-with-pinned-max regression that the source/target-only guard would have admitted" — correctly states what the test guards against.

Property test claim "a transfer always strictly decreases the spread" (§5 line 210) is confirmed correct. The full-map check (reject when spread(m') >= spread(m)) guarantees that any issued transfer has spread(m') < spread(m) by definition. The stronger "strictly decreases" claim holds, the property test asserts new_spread < old_spread, and case (d) is consistent with this: the transfer is not issued (no_eligible_move), so the "strict decrease" invariant is never violated. The round-12 correction (promoting "never increases" to "strictly decreases" once the full-map check was added) remains accurate. ✓

All prior-round findings remain intact. Verified key elements:

Saturating threshold (§3.5): threshold = Commit > maxLag ? Commit - maxLag : 0; eligible = Match >= threshold — not either bare subtraction form. ✓
Both underflow directions documented (young-group Commit < maxLag; quorum->2 Match > Commit). ✓
Voter-map zero-seed before counting (§3.2). ✓
conf-change guard under e.pending.Lock() (§3.5, OQ-12). ✓
errLeadershipTransferInFlight unconditional guard before rawNode.TransferLeader (§3.5, OQ-15). ✓
bool gated = 3 / uint64 max_lag = 4 proto extension with gated-based routing (§3.4, OQ-4). ✓
--leaderBalanceMaxTargetLag = 0 is valid strictest gate, not a disable (§3.4, §3.6). ✓
Mixed-version rollout rule (§3.4, §3.6). ✓
Progress.RecentActive replacing Status.LastContact (§3.5). ✓
TransferLeadershipToServerIfEligible engine method keeping Progress internal (§3.5, OQ-13). ✓
Two-pass default-group eligibility (pass 1 non-default; pass 2 default only if pass 1 found no eligible move, §3.5 / OQ-3). ✓
leaderbalance_unobservable_groups companion gauge (§3.6). ✓
no_eligible_move label in §3.6 metric enum, semantically distinct from below_threshold. ✓
OQ-3 updated to eligibility-based two-pass rule (§8). ✓
Source fall-through policy in §3.3 (iterate in descending-count order, first eligible strict-reducing pair wins). ✓
Forwarded-path case-2 integration test scenario (§5). ✓

The document is ready for implementation. All 15 open questions are resolved with concrete recommendations or explicitly deferred with justification.
| Branch

Issue 1: correct stale line anchors in kv/sharded_coordinator.go - keyviz sampler observeMutation: 1795-1824 -> 1841-1846 - RunHLCLeaseRenewal: 1914-1953 -> 1960-1985; defaultGroup access :1915 -> :1961 Issue 2: soften companion-doc references to in-flight PR #955 form, matching the #945/#951/#953 branch-reference style (3 sites + map line). Issue 3: ground OQ-1 in the actual commitTS logic (nextStartTS/ resolveTxnCommitTS/nextTxnTSAfter all from one c.clock) and annotate §4 step 11 as deferred-pending-OQ-1 with an explicit trigger condition. Inline: fix learner LinearizableRead behaviour (engine returns ErrNotLeader, caller forwards; engine.go:1583); note shared-startTS invariant; add OQ-7 for the single-node->multi-node live cutover / rolling-upgrade strategy.

docs(design): propose leader balance scheduler

d2e84be

gemini-code-assist Bot reviewed Jun 11, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread docs/design/2026_06_11_proposed_leader_balance_scheduler.md Outdated

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

docs(design): address review round 2 — leader balance scheduler

c08e7c8

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

docs(design): address review round 4 — leader balance scheduler

1304cfb

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

docs(design): strict-gate max_lag semantics + source fallback — leade…

e789c20

…r balance

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

docs(design): execution-time in-flight-transfer guard — leader balance

14222f2

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

This was referenced Jun 12, 2026

docs(design): propose scaling roadmap #954

Open

docs(design): propose multi-node multi-group bootstrap #955

Open


		### 1.1 The problem

		elastickv runs multiple Raft groups in one process (`--raftGroups id=addr,id=addr,…`, `shard_config.go:61-99`; default group is the lowest ID, `shard_config.go:386-397`). Every node is a member of every group: `buildShardGroups` iterates the parsed `groups` once per process and constructs a `raftGroupRuntime` per group (`main.go:786-889`), each with its own engine and its own gRPC listener at `rt.spec.address` (`main.go:1606-1620`). Leadership of each group is elected independently by etcd/raft, so there is no mechanism today that spreads leaderships across nodes. After a rolling restart, a partition heal, or simply unlucky election timing, one node can end up leading every group while its peers lead none. That node then carries all the leader-only work — write proposals, HLC ceiling renewal, lease reads, OCC timestamp issuance, route-catalog proposes — while the rest of the cluster sits idle. The user's explicit goal: "TiKVのようにリーダー以外のノードに別のRaftグループのリーダーをおいて負荷を均等化したい" — put other groups' leaders on the non-leader nodes so the load is even.


		### 3.2 Observation (decision: local per-group `State()` / `Leader()`, no polling RPC)

		Once the PR0 topology (§1.1a) is in place so each group has voters on more than one node, the default-group leader is itself a member (leader or follower) of every group it co-hosts and can read the local engine's view of who leads each group with zero network cost: iterate `runtimes`, and for each `rt.snapshotEngine()` read `State()` and `Leader()` (`internal/raftengine/engine.go:131-138`). (In the single-process multi-group demo this still works, but each group has exactly one voter, so the count map is trivially balanced and no transfer is ever issued — the loop is observe-only there.) This is exactly what `publishLeaderTerms` already does for term (`main.go:2126-2143`); the balancer adds a sibling reader for leader identity. From the per-group `Leader().ID` the scheduler builds the leader-count map `nodeID → number of groups this node currently leads`.


		A group is eligible for a transfer this cycle only if all of the following hold; otherwise it is skipped (and counted in a skip metric):

		- No conf-change in flight. Skip a group whose membership is changing — a transfer racing a conf-change can land on a member about to be removed. The cleanest in-tree signal (resolving OQ-5) is to surface the existing per-group pending-config state on the leader through the exported `Status`: the etcd backend already tracks `pendingConfigs` (`internal/raftengine/etcd/engine.go:390`, `:563`), but it is not on the public `raftengine.Status` struct (`internal/raftengine/engine.go:67-83`). The recommendation is to add a `PendingConfChange bool` field to `Status` populated from `len(e.pendingConfigs) > 0` (mirroring how `LeadTransferee` was added to the struct), rather than leaking etcd internals via `rawNode.BasicStatus().Config.PendingConfIndex` past the engine boundary or adding a dedicated `Admin` method. This is only observable on the group leader (where conf-changes are proposed), which fits the §3.4 ownership rule — in case 2 the executing remote leader sees it, and the balancer's forwarded transfer will be rejected if the leader has a conf-change pending.

Conversation

bootjp commented Jun 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key design points

Open questions

Test plan

Summary by CodeRabbit

Uh oh!

bootjp commented Jun 11, 2026

Uh oh!

coderabbitai Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated Code Review Effort

Poem

Uh oh!

claude Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: docs/design/2026_06_11_proposed_leader_balance_scheduler.md

Substantive issues (should be resolved before PR2)

Design clarity issues

Smaller issues

What's well done

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 11, 2026

Choose a reason for hiding this comment

Risk of Transfer Storms on Default-Group Leadership Changes

Uh oh!

bootjp Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 11, 2026

Choose a reason for hiding this comment

Authentication Gap for Forwarded gRPC Requests

Uh oh!

gemini-code-assist Bot Jun 11, 2026

Choose a reason for hiding this comment

Technical Gap: Followers Lack Progress and Liveness Information for Target Selection

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bootjp commented Jun 11, 2026

Uh oh!

claude Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: docs/design/2026_06_11_proposed_leader_balance_scheduler.md

Must-fix before marking *_partial_*

Should resolve before PR2 starts

Lower-priority / polish

What's well done in this revision

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

bootjp commented Jun 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

claude Bot commented Jun 11, 2026 •

edited

Loading

Review: `docs/design/2026_06_11_proposed_leader_balance_scheduler.md`

claude Bot commented Jun 11, 2026 •

edited

Loading

Review: `docs/design/2026_06_11_proposed_leader_balance_scheduler.md`

Must-fix before marking `_partial_`

claude Bot commented Jun 11, 2026 •

edited

Loading

Review: `docs/design/2026_06_11_proposed_leader_balance_scheduler.md` (round 3 — `c08e7c8`)

claude Bot commented Jun 11, 2026 •

edited

Loading

Review: `docs/design/2026_06_11_proposed_leader_balance_scheduler.md` (round 4 — `ef5dbd9`)

Must-fix before PR2 ships the forwarded-path `IfEligible` routing

claude Bot commented Jun 11, 2026 •

edited

Loading

Review: `docs/design/2026_06_11_proposed_leader_balance_scheduler.md` (round 5 — `1304cfb`)

Must-address before marking `_partial_`

claude Bot commented Jun 11, 2026 •

edited

Loading