docs(design): propose leader balance scheduler#953
Conversation
|
@claude review |
|
Warning Review limit reached
More reviews will be available in 13 minutes and 42 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more credits in the billing tab to continue. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThis PR adds a design proposal for an in-process leader-balancing scheduler that distributes Raft-group leaderships across cluster nodes. It specifies observation mechanisms, deterministic balancing policy with hysteresis, safety exclusions, transfer execution via gRPC, operational controls, and a phased implementation plan with testing strategy. ChangesLeader Balance Scheduler Design Proposal
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Claude finished @bootjp's task in 4m 10s —— View job Review:
|
There was a problem hiding this comment.
Code Review
This pull request proposes a design document for a Leader Balance Scheduler to automatically distribute Raft-group leaderships across nodes in elastickv. The feedback highlights three key technical issues to address in the design: the risk of transfer storms or thrashing during default-group leadership changes (recommending a startup warm-up delay), an authentication gap for forwarded gRPC requests when --adminTokenFile is enabled, and a fundamental technical gap where follower nodes lack the replication progress and liveness information of peer followers needed to safely select a transfer target.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| - The M3 auto-split scheduler is placed there too (sibling doc §4.1, §7.6) — co-locating the leader balancer keeps "one scheduler seat per cluster" rather than electing yet another coordinator. | ||
| - There is exactly one default-group leader at a time, so the scheduler is singleton without extra leader election. | ||
|
|
||
| **State reset on election.** All balancer state (per-group cooldown deadlines, the global cooldown deadline, the last observed leader map) is **leader-local, in-memory, not Raft-replicated**. On a default-group leadership change the deposed leader's scheduler goroutine stops (via `RegisterLeaderLossCallback`, `kv/coordinator.go:131`) and the new leader starts with empty state. The worst case of a lost cooldown is a *too-soon* extra transfer (mildly wasteful, self-correcting next cycle), never an unsafe action — so non-replication is the right cost/safety trade (same rationale as M3 §7.6). We state this explicitly so it is not mistaken for an oversight. |
There was a problem hiding this comment.
Risk of Transfer Storms on Default-Group Leadership Changes
If the default-group leader flaps rapidly (e.g., due to transient network instability or CPU starvation), each newly elected leader will start with an empty cooldown state and might immediately trigger a transfer. This can lead to a "transfer storm" or thrashing of group leaderships across the cluster.
This risk is particularly acute when transferring the leadership of the default group itself. Since the balancer runs on the default-group leader, transferring the default group forces the balancer to restart on a new node with a completely empty cooldown state. Without an initial startup delay, the new balancer might immediately trigger subsequent transfers, leading to thrashing.
Recommendation:
Introduce an initial startup delay or "warm-up" cooldown for the scheduler upon election (e.g., waiting at least one leaderBalanceInterval or a global cooldown before initiating the first transfer). This ensures the cluster has stabilized after a default-group election before the balancer begins making changes.
There was a problem hiding this comment.
Already in §3.1 line 75 ("Startup grace period (anti-storm on balancer-host churn)."): on every acquisition of the balancer role the scheduler arms a startup-grace deadline of max(leaderBalanceGlobalCooldown, one leaderBalanceInterval) during which it observes-and-emits-metrics-but-issues-no-transfer, and the deadline is re-armed on every role acquisition so a flapping default leader cannot accumulate enough stable time to fire. The recursive default-group-self-transfer case the recommendation calls out is also covered there. Skipped during the grace window are counted under leaderbalance_skipped_total{reason=startup_grace}.
| The balancer host (the default-group leader) is a member of every group but is, in general, a **follower** of the group it wants to rebalance (indeed, the whole point is to move leadership *off* the over-loaded node, which may or may not be the balancer host). Two sub-cases: | ||
|
|
||
| 1. **The over-loaded source node is the balancer host itself** (it leads the target group). Then the balancer calls `engine.TransferLeadershipToServer(ctx, target.ID, target.Address)` directly on the **local** runtime's engine for that group — the local engine is the leader, the call is accepted, and the goroutine-offload + idempotency patterns from `main_sqs_leadership_refusal.go:88-95` apply verbatim. | ||
| 2. **The over-loaded source node is a *different* node.** The balancer host is a follower of that group and cannot initiate the transfer locally. It **forwards** the request to the source node's leader of that group, using the in-tree per-group `RaftAdmin` gRPC service: dial the source node's group listener (`rt.spec.address` for that group is in the group's `Configuration`, available via `engine.Configuration(ctx)`, `internal/raftengine/engine.go:213-215`) and call `RaftAdmin.TransferLeadership` with `TargetId`/`TargetAddress` set to the chosen target (the same RPC `cmd/raftadmin` uses, `internal/raftadmin/server.go:155-172`, `cmd/raftadmin/main.go:359-378`). The receiving node's engine is the leader of that group, so `handleTransferLeadership` accepts it. |
There was a problem hiding this comment.
Authentication Gap for Forwarded gRPC Requests
If the cluster is configured with --adminTokenFile (enabling bearer-token authentication on the RaftAdmin gRPC service), any forwarded TransferLeadership requests sent by the balancer host to other nodes will be rejected with Unauthenticated or PermissionDenied unless the balancer's gRPC client presents the correct token.
Recommendation:
The design should explicitly specify how the balancer host obtains and attaches this token (e.g., by sharing the loaded token or the token file path with the balancer's gRPC client/connection cache) to ensure seamless operation in secure environments.
| A group is **eligible** for a transfer this cycle only if **all** of the following hold; otherwise it is skipped (and counted in a skip metric): | ||
|
|
||
| - **No conf-change in flight.** Skip a group whose membership is changing. (Surface via the engine status / configuration read; never transfer mid-membership-change — a transfer racing a conf-change can land on a member about to be removed.) **OQ-5** asks for the cleanest in-tree signal for "conf-change pending" per group. | ||
| - **Target follower is caught up / healthy.** etcd/raft only *completes* a transfer once the transferee's log has caught up to the leader (`waitForLeadershipTransfer` will otherwise observe the transfer abort, `internal/raftengine/etcd/engine.go:1384-1405`, `:1431-1437`). Transferring to a lagging follower stalls writes on that group until the transferee catches up or raft aborts. The scheduler therefore **must not pick a lagging follower as target**: it filters targets to followers whose `Progress.Match` (or the available `Status` liveness/last-contact proxy) is close enough to the leader's commit index, and prefers a follower with recent contact. The targeted transfer also fails fast at submit time if raft drops it (`errLeadershipTransferRejected`, `:1765-1768`), so a bad target surfaces as a logged failure rather than a silent stall. |
There was a problem hiding this comment.
Technical Gap: Followers Lack Progress and Liveness Information for Target Selection
There is a fundamental technical gap in the proposed target selection mechanism: in etcd/raft, only the active leader of a Raft group maintains the Progress map (tracking Match and Next indices) and receives active heartbeat responses (MsgHeartbeatResp / MsgAppResp) from followers. Followers do not exchange heartbeats with each other and do not have access to the replication progress or liveness of peer followers.
Since the balancer host (the default-group leader) is a follower for most other Raft groups, its local engine for those groups will have an empty/unpopulated Progress map and no liveness/last-contact info for peer followers. Thus, the balancer host cannot locally filter or select a healthy/caught-up target follower for groups it does not lead.
If it attempts to make a targeted transfer (TransferLeadershipToServer) without this information, it risks selecting a lagging, partitioned, or dead follower, which will stall the group's write path (since etcd/raft drops proposals while LeadTransferee is set).
Recommendation:
Consider one of the following approaches to resolve this:
- Ordered Preference List: The balancer host passes an ordered list of preferred target nodes (from least-loaded to most-loaded) when forwarding the request to the group's leader. The group's leader then selects the first target in the list that is healthy and caught up according to its local
Progressmap. - Leader-Delegated Selection: The balancer host queries the group's leader for the list of healthy/caught-up voters before making the decision, or delegates the final target selection to the group's leader within a set of acceptable least-loaded candidates.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d2e84be826
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| ### 1.1 The problem | ||
|
|
||
| elastickv runs multiple Raft groups in one process (`--raftGroups id=addr,id=addr,…`, `shard_config.go:61-99`; default group is the lowest ID, `shard_config.go:386-397`). Every node is a member of **every** group: `buildShardGroups` iterates the parsed `groups` once per process and constructs a `raftGroupRuntime` per group (`main.go:786-889`), each with its own engine and its own gRPC listener at `rt.spec.address` (`main.go:1606-1620`). Leadership of each group is elected independently by etcd/raft, so there is **no mechanism today that spreads leaderships across nodes**. After a rolling restart, a partition heal, or simply unlucky election timing, one node can end up leading every group while its peers lead none. That node then carries all the leader-only work — write proposals, HLC ceiling renewal, lease reads, OCC timestamp issuance, route-catalog proposes — while the rest of the cluster sits idle. The user's explicit goal: "TiKVのようにリーダー以外のノードに別のRaftグループのリーダーをおいて負荷を均等化したい" — put other groups' leaders on the non-leader nodes so the load is even. |
There was a problem hiding this comment.
Add the missing multi-node multi-group topology milestone
This premise does not hold for the current server: resolveBootstrapServers rejects --raftBootstrapMembers whenever len(groups) != 1 (main.go:742-748), and the existing M5 script explicitly treats multi-group as a single-process/single-member topology. As written, the later scheduler PRs have no deployable/testable topology where every node is a voter in every group, so TransferLeadershipToServer cannot actually spread Raft-group leaders across nodes until the design first adds per-group peer/bootstrap membership support.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
docs/design/2026_06_11_proposed_leader_balance_scheduler.md (4)
82-87: 💤 Low valueAcknowledge TOCTOU race in forwarded transfer path.
When the balancer observes "node X leads group Y" and forwards a
TransferLeadershipRPC to X, leadership of Y could change between observation and RPC delivery. The design notes (line 81) that the call will fail witherrLeadershipTransferNotLeaderbut doesn't discuss whether the balancer retries, logs the failure, or simply waits for the next cycle.Consider adding a brief note to §3.4 that forwarded-path failures due to leadership changes are tolerated (counted in
leaderbalance_transfers_failed_total{reason=not_leader}, no retry, next cycle observes fresh state). This clarifies the expected behavior and prevents implementation uncertainty.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/design/2026_06_11_proposed_leader_balance_scheduler.md` around lines 82 - 87, Add a short paragraph to §3.4 stating that forwarded TransferLeadership RPCs (sent via RaftAdmin.TransferLeadership after reading engine.Configuration) may race with actual leadership changes and can return errLeadershipTransferNotLeader; these failures must be treated as tolerated (increment leaderbalance_transfers_failed_total{reason="not_leader"}), do not trigger immediate retries, and are resolved by the balancer observing fresh state on the next scheduling cycle rather than retrying the RPC inline.
102-103: 💤 Low valueDocument HLC transfer latency assumption.
The safety argument for default-group balancing states: "the window (3 s) exceeds the renewal interval (1 s)" so a transfer preserves the ceiling invariant. This assumes the transfer completes quickly relative to the window. If a transfer takes multiple seconds (e.g., due to slow target catch-up or network delays), the new leader might start renewing closer to the old ceiling's expiration.
Add a sentence noting that
TransferLeadershipis designed to complete within one election timeout (typically < 1s when the target is healthy), which is well within the 3s window margin. If etcd/raft can delay transfers longer, document that the ceiling safety still holds because the new leader clampsNext()tomax(wall, ceiling)regardless of transfer latency—this clarifies the invariant doesn't depend on transfer speed.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/design/2026_06_11_proposed_leader_balance_scheduler.md` around lines 102 - 103, Add a short clarifying sentence to the HLC transfer paragraph stating that TransferLeadership is intended to complete within one election timeout (typically <1s for a healthy target), which is safely inside hlcPhysicalWindowMs (3s) given hlcRenewalInterval (1s); also note that even if etcd/raft delays TransferLeadership longer, the HLC ceiling safety still holds because the new leader’s Next() clamps to max(wall, ceiling), and leader loss triggers lease.invalidate via RegisterLeaderLossCallback so the first read will take the LinearizableRead path until the lease warms.
122-122: ⚖️ Poor tradeoffConsider deferring forwarded transfer to PR3 to reduce PR2 scope.
PR2 includes both the local-leader transfer path (case 1, §3.4) and the forwarded-path gRPC client (case 2). The forwarded path adds gRPC dial/credentials handling, connection caching, remote error handling, and the TOCTOU failure mode—significant incremental complexity.
Splitting the implementation could make PR2 easier to review:
- PR2: Policy + loop + local transfer only (balancer acts only when it is the over-loaded source node).
- PR3: Forwarded path + SQS/pinning awareness.
This keeps PR2 focused on the core policy/loop correctness and defers the distributed-call complexity. The trade-off is that PR2 alone won't balance clusters where the default-group leader is under-loaded, but it's still independently useful and verifiable. Alternatively, keep the current plan if you prefer end-to-end balancing in PR2—this is a reviewability suggestion, not a correctness issue.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/design/2026_06_11_proposed_leader_balance_scheduler.md` at line 122, The reviewer suggests reducing PR2 scope by removing the forwarded-transfer path and keeping only the policy function, scheduler loop, and local transfer execution behind the --leaderBalance switch; refactor so the policy function (policy* / makePolicy / decideTransfers), scheduler loop (schedulerLoop / runBalancer), and local transfer code (executeLocalTransfer / performTransfer) remain, while extracting or gating out forwardedTransfer / grpcClient / connectionCache / remoteTransfer logic and related tests into a follow-up PR3; update integration tests (demo.go) and unit tests to only exercise local-source transfers and add TODOs/placeholders for the forwarded-path, ensuring all forwarded-path gRPC dialing, credentials, remote error handling, and TOCTOU handling are removed from PR2.
133-133: ⚡ Quick winExpand integration test scenarios beyond 3-node.
The test strategy specifies a 3-node convergence test but doesn't cover other important cluster/group configurations that stress different policy code paths.
Suggested additional integration test scenarios
Add to the test plan (or list as future integration tests):
- 2-node cluster with odd number of groups (e.g., 3 groups): spread is at most 1, balancer should remain idle (threshold not met).
- 5-node cluster with 4 groups: one node will always have 0 leaders; verify balancer stabilizes at
⌈4/5⌉ = 1leader per node max.- All groups initially on the default-group leader: verify the balancer balances away from itself (self-evacuation case).
- Forwarded-path failure injection: simulate leadership change of target group between observation and RPC; verify the failure is logged and the next cycle recovers.
These cases exercise tie-break logic, self-transfer, and forward-path error handling that the 3-node test might not hit.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/design/2026_06_11_proposed_leader_balance_scheduler.md` at line 133, Update the integration test plan to add the suggested scenarios beyond the 3-node demo: add a 2-node cluster test with an odd number of groups (e.g., 3) to assert the balancer remains idle, a 5-node cluster with 4 groups to assert max leaders per node is ⌈4/5⌉, a test where all groups start on the default-group leader to exercise self-evacuation, and a forwarded-path failure injection where a target group’s leadership changes between observation and RPC to assert proper logging and recovery; wire these into the same demo harness used by cmd/server/demo.go and the --leaderBalance toggle so they run under the same convergence/assertion logic and include assertions that the cluster stabilizes and does not ping-pong after convergence.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/design/2026_06_11_proposed_leader_balance_scheduler.md`:
- Line 97: The design doc leaves "caught up / healthy" vague; update the
scheduler section to specify concrete numeric eligibility rules and proxies:
state that a follower is eligible iff Progress.Match >= leader.CommitIndex -
maxLag (define maxLag in index units, e.g., 10 entries) and/or if Progress isn't
available use Status.LastContact within healthThreshold (expressed in ms and
tied to electionTimeout, e.g., LastContact <= electionTimeout/2); explicitly
state learners are filtered out before submission (handleTransferLeadership will
also reject them) and note that submit-time rejections surface as
errLeadershipTransferRejected so the scheduler still logs failures.
---
Nitpick comments:
In `@docs/design/2026_06_11_proposed_leader_balance_scheduler.md`:
- Around line 82-87: Add a short paragraph to §3.4 stating that forwarded
TransferLeadership RPCs (sent via RaftAdmin.TransferLeadership after reading
engine.Configuration) may race with actual leadership changes and can return
errLeadershipTransferNotLeader; these failures must be treated as tolerated
(increment leaderbalance_transfers_failed_total{reason="not_leader"}), do not
trigger immediate retries, and are resolved by the balancer observing fresh
state on the next scheduling cycle rather than retrying the RPC inline.
- Around line 102-103: Add a short clarifying sentence to the HLC transfer
paragraph stating that TransferLeadership is intended to complete within one
election timeout (typically <1s for a healthy target), which is safely inside
hlcPhysicalWindowMs (3s) given hlcRenewalInterval (1s); also note that even if
etcd/raft delays TransferLeadership longer, the HLC ceiling safety still holds
because the new leader’s Next() clamps to max(wall, ceiling), and leader loss
triggers lease.invalidate via RegisterLeaderLossCallback so the first read will
take the LinearizableRead path until the lease warms.
- Line 122: The reviewer suggests reducing PR2 scope by removing the
forwarded-transfer path and keeping only the policy function, scheduler loop,
and local transfer execution behind the --leaderBalance switch; refactor so the
policy function (policy* / makePolicy / decideTransfers), scheduler loop
(schedulerLoop / runBalancer), and local transfer code (executeLocalTransfer /
performTransfer) remain, while extracting or gating out forwardedTransfer /
grpcClient / connectionCache / remoteTransfer logic and related tests into a
follow-up PR3; update integration tests (demo.go) and unit tests to only
exercise local-source transfers and add TODOs/placeholders for the
forwarded-path, ensuring all forwarded-path gRPC dialing, credentials, remote
error handling, and TOCTOU handling are removed from PR2.
- Line 133: Update the integration test plan to add the suggested scenarios
beyond the 3-node demo: add a 2-node cluster test with an odd number of groups
(e.g., 3) to assert the balancer remains idle, a 5-node cluster with 4 groups to
assert max leaders per node is ⌈4/5⌉, a test where all groups start on the
default-group leader to exercise self-evacuation, and a forwarded-path failure
injection where a target group’s leadership changes between observation and RPC
to assert proper logging and recovery; wire these into the same demo harness
used by cmd/server/demo.go and the --leaderBalance toggle so they run under the
same convergence/assertion logic and include assertions that the cluster
stabilizes and does not ping-pong after convergence.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: abd72a72-2b66-4689-9ff8-b39225e520c6
📒 Files selected for processing (1)
docs/design/2026_06_11_proposed_leader_balance_scheduler.md
- P1 (codex): add 1.1a topology prerequisite + PR0 milestone. The premise "every node is a member of every group" does not hold for any deployable multi-node topology: resolveBootstrapServers rejects multi-group bootstrap (main.go:742-748), groupSpec is single-address (shard_config.go:14-17), buildRuntimeForGroup gives every group nil peers (multiraft_runtime.go:246-254), and run-jepsen-m5-local.sh documents single-process single-member-per-group only. Scheduler PRs are blocked on PR0. OQ-9 records bootstrap-extension vs. AddVoter-composition. - gemini (1): add startup grace period (3.1) after acquiring the balancer role to break the transfer-storm loop when the default-group leader (the balancer host) flaps; --leaderBalanceStartupGrace flag + startup_grace skip metric. - gemini (2): state RaftAdmin security posture (3.4 + Security lens). Corrected mechanism: --adminTokenFile gates only /Admin/, not /RaftAdmin/ (admin_grpc.go:484,498) — so forwarded transfers are NOT token-rejected, but RaftAdmin is unauthenticated; require trusted-network boundary. OQ-11. - gemini (3): make target catch-up validation the executing group leader's job (3.5) — Progress is leader-only and absent from exported Status; balancer passes a preference, the leader (case 1 local / case 2 remote) gates health. OQ-10 for ordered preference list. - Also: LeadTransferee!=0 skip + OQ-5 PendingConfChange recommendation; updated milestones (PR0), test strategy, five-lens table, lifecycle. Doc-only; no code/tests.
|
@claude review |
|
Claude finished @bootjp's task in 3m 41s —— View job Review:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8b0e09411c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| ### 3.2 Observation (decision: local per-group `State()` / `Leader()`, no polling RPC) | ||
|
|
||
| Once the PR0 topology (§1.1a) is in place so each group has voters on more than one node, the default-group leader is itself a member (leader or follower) of every group it co-hosts and can read **the local engine's view of who leads each group** with zero network cost: iterate `runtimes`, and for each `rt.snapshotEngine()` read `State()` and `Leader()` (`internal/raftengine/engine.go:131-138`). (In the single-process multi-group demo this still works, but each group has exactly one voter, so the count map is trivially balanced and no transfer is ever issued — the loop is observe-only there.) This is exactly what `publishLeaderTerms` already does for term (`main.go:2126-2143`); the balancer adds a sibling reader for leader identity. From the per-group `Leader().ID` the scheduler builds the **leader-count map** `nodeID → number of groups this node currently leads`. |
There was a problem hiding this comment.
Seed zero-leader voters in the count map
When the starting state is the one this scheduler is meant to fix (for example A leads every group and B/C lead none), building the leader-count map only from observed Leader().ID values produces {A: N} and omits B/C, so max(count) - min(count) is 0 and the policy never fires. The observation/policy needs to seed all eligible voter node IDs from each group's Configuration() with count 0 before applying the threshold, otherwise the PR2 convergence test that forces all leaders onto one node cannot pass.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Already in §3.2 line 81 ("Seed the count map with every voter at zero before counting leaders."). The doc explicitly seeds the union of voter IDs from each co-hosted group's Configuration() with count 0 before incrementing from observed leaders, calls out the exact worst-case the comment describes ('node A leads every group, nodes B and C lead none ... max(count) - min(count) is N - N = 0 ... never fires'), and gates the PR2 convergence test (§5) on this seed so the policy fires on the very imbalance the scheduler exists to fix. Voters with Suffrage == "voter" are seeded; learners are excluded because handleTransferLeadership rejects them as targets.
|
|
||
| A group is **eligible** for a transfer this cycle only if **all** of the following hold; otherwise it is skipped (and counted in a skip metric): | ||
|
|
||
| - **No conf-change in flight.** Skip a group whose membership is changing — a transfer racing a conf-change can land on a member about to be removed. The cleanest in-tree signal (resolving **OQ-5**) is to **surface the existing per-group pending-config state on the leader through the exported `Status`**: the etcd backend already tracks `pendingConfigs` (`internal/raftengine/etcd/engine.go:390`, `:563`), but it is not on the public `raftengine.Status` struct (`internal/raftengine/engine.go:67-83`). The recommendation is to add a `PendingConfChange bool` field to `Status` populated from `len(e.pendingConfigs) > 0` (mirroring how `LeadTransferee` was added to the struct), rather than leaking etcd internals via `rawNode.BasicStatus().Config.PendingConfIndex` past the engine boundary or adding a dedicated `Admin` method. This is only observable on the group leader (where conf-changes are proposed), which fits the §3.4 ownership rule — in case 2 the executing remote leader sees it, and the balancer's forwarded transfer will be rejected if the leader has a conf-change pending. |
There was a problem hiding this comment.
Reject forwarded transfers during pending config changes
This says the case-2 forwarded transfer will be rejected when the remote group leader has a pending conf-change, but the current RaftAdmin.TransferLeadership path just calls TransferLeadershipToServer, and handleTransferLeadership only checks leader state/raft acceptance, not pendingConfigs. Surfacing PendingConfChange in Status is not enough for remote-source groups unless the design also adds a server-side guard or an internal RPC check; otherwise the scheduler can still transfer during a pending Add/Remove, violating this safety exclusion.
Useful? React with 👍 / 👎.
|
Addressed review round 2 in c08e7c8 (doc-only). Per-finding dispositions: codex P1 — "Seed zero-leader voters in the count map" (ACCEPTED). §3.2 now mandates seeding the leader-count map with count 0 for every voter node ID across all co-hosted groups before counting observed leaders. Without the seed, the very state the scheduler exists to fix (A leads all groups, B/C lead none) builds codex P2 — "Reject forwarded transfers during pending config change" (ACCEPTED). Verified the claim: claude #1 — §1.1a self-contradicts on PR1 blocking (ACCEPTED). PR1 is observation-only and issues no transfers, so it is not blocked on PR0. §1.1a now reads "Only the transfer-issuing scheduler PRs (PR2–PR3) are blocked on PR0; PR1 ships even against today's single-voter topology." Consistent with the PR1 milestone row and §9 lifecycle. claude #2 — case-1 "may" → "should" (ACCEPTED). §3.5 case-1: in case 1 the balancer is the leader and holds the live Progress map, so it is the most accurate place to filter and should pre-filter (skipping it risks a write stall on a lagging target). claude #3 / coderabbit Critical — concrete catch-up threshold (ACCEPTED). §3.5 case-1 now gives implementable numbers: caught-up = claude #4 — OQ-3 recommendation (ACCEPTED). Adopted: balance the default group LAST — included in the eligible-source set only once every other group's source node is ≤ claude #5 — OQ-9 recommendation (ACCEPTED). Adopted option (a) — extend the bootstrap/flag surface (lift the claude #6 — homogeneous-membership constraint (ACCEPTED). New §2.2 Non-Goal: heterogeneous group membership across nodes is out of scope for v1; the scheduler assumes every default-group participant is a member of every other group. claude #7 — claude #8 — forwarded-path integration test (ACCEPTED). §5 adds a case-2 scenario: force the imbalance on a non-default-group-leader node, assert the balancer forwards over claude #9 — "PR-doc" label (ACCEPTED). Milestone row relabeled "Doc lifecycle (this PR → gemini line 55/86 (already addressed at prior HEAD, no change needed). Transfer-storm → §3.1 startup grace; forwarded-auth gap → §3.4 security posture + OQ-11. Status stays @claude review |
|
Claude finished @bootjp's task in 5m 28s —— View job Review:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c08e7c8f55
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - **Recent contact (liveness):** `Status.LastContact < electionTimeout / 2`, using the already-exported `LastContact time.Duration` field on `raftengine.Status` (`internal/raftengine/engine.go:77`). Half an election timeout is the standard liveness margin: a follower the leader has heard from within that window is very unlikely to be partitioned or dead, so the transfer will not stall on an unreachable transferee. The bound is operator-tunable via `--leaderBalanceMaxTargetLastContact` (default `electionTimeout / 2`). | ||
| - **Learners excluded** before submission (and `handleTransferLeadership` rejects them as "not a voter" as a backstop, `internal/raftengine/etcd/engine.go:1765-1768`). | ||
| Both thresholds are constants available at scheduler construction time and are plumbed as the two flags above. If a future revision prefers to lean entirely on the submit-time rejection even in case 1, that must be stated explicitly; v1 applies the pre-filter in case 1. | ||
| - **Case 2 (forwarded to a remote leader):** the balancer cannot pre-filter target catch-up *or* see the remote leader's pending conf-change locally — both are enforced on the executing leader. It forwards its chosen target (and, in PR3, optionally an ordered preference list) to the source-group leader; the executing leader rejects (1) a transfer racing a membership change via the conf-change guard of the "No conf-change in flight" bullet above (`errLeadershipTransferConfChangePending`), and (2) an un-caught-up target at submit time (`errLeadershipTransferRejected` when raft drops the transfer because the target has no/insufficient progress, `:1765-1768`); `waitForLeadershipTransfer` fails closed if the transfer later aborts (`:1384-1405`, `:1431-1437`). The balancer maps these outcomes to a logged skip + `leaderbalance_skipped_total{reason=conf_change}` / `leaderbalance_transfers_failed_total{reason=rejected|aborted}` and tries again next cycle. |
There was a problem hiding this comment.
Add a real remote catch-up guard before transfer
In the forwarded case where the least-loaded target is a lagging voter, this relies on TransferLeadershipToServer returning errLeadershipTransferRejected for insufficient progress, but etcd/raft v3.6 accepts a voter with a Progress entry, sets leadTransferee, and only sends append traffic until the target catches up or times out. Because PR2 only plans Progress exposure for the case-1 local filter, following this design can stall writes on the remote-led group for an election timeout whenever the balancer picks a stale follower; the executing leader needs an explicit Progress/health check before calling TransferLeader.
Useful? React with 👍 / 👎.
| - **Concretely:** | ||
| - **Case 1 (balancer host leads the group):** the balancer *does* have the live Progress map for that group, so it **should** pre-filter the preferred target before calling `TransferLeadershipToServer` locally — in case 1 the balancer *is* the leader, so it is the most accurate place to filter, and skipping the filter would expose the group to a write stall if a lagging target were chosen and raft had to wait for it to catch up. The concrete eligibility predicate (no longer "close enough" hand-waving): | ||
| - **Caught up:** `target.Progress.Match >= leader.CommitIndex - maxLag`, where **`maxLag` is in log-entry units** with default **`--leaderBalanceMaxTargetLag = 1024` entries**. Rationale: this is roughly the amount a healthy follower can fall behind within one election timeout under normal append throughput, so a target at most one election-timeout's worth of log behind is still able to complete the transfer quickly once it becomes leader-transferee; tighter values risk never finding an eligible target under steady write load, looser values risk a longer write stall on the new leader's group while raft waits. `Progress.Match` is the leader-side per-peer matched index already maintained by etcd/raft (used in-tree by `promoteLearner`, `internal/raftengine/etcd/engine.go:1679`); exposing it to the case-1 filter requires either reading it inside the engine (case 1 runs on the local leader) or a small accessor — tracked under the same Progress-exposure note as case 2. | ||
| - **Recent contact (liveness):** `Status.LastContact < electionTimeout / 2`, using the already-exported `LastContact time.Duration` field on `raftengine.Status` (`internal/raftengine/engine.go:77`). Half an election timeout is the standard liveness margin: a follower the leader has heard from within that window is very unlikely to be partitioned or dead, so the transfer will not stall on an unreachable transferee. The bound is operator-tunable via `--leaderBalanceMaxTargetLastContact` (default `electionTimeout / 2`). |
There was a problem hiding this comment.
Do not use leader Status.LastContact for target liveness
For the local-leader case, this predicate cannot identify whether the target follower is recently reachable: Status.LastContact is a single field on the local engine, and lastContactFor returns 0 whenever that local engine is StateLeader. In a case-1 transfer the balancer is reading the leader's status, so this liveness check always passes even if the chosen target is partitioned; the design should instead expose per-peer leader-side activity/Progress or drop this as a claimed filter.
Useful? React with 👍 / 👎.
- codex P1: etcd/raft v3.6 accepts a transfer to any lagging voter (sets leadTransferee, pauses proposals) rather than rejecting it; add an engine-side catch-up guard (Progress.Match >= Commit - maxLag) inside handleTransferLeadership before rawNode.TransferLeader, exposed via a new TransferLeadershipToServerIfEligible method used by BOTH case 1 (local) and case 2 (remote leader). Progress exposure lands at the guard in PR2. - codex P2 / claude #1: Status.LastContact is 0 on a leader (lastContactFor returns 0 for StateLeader) and tracks only this node's contact with ITS leader — drop it as the liveness gate; use leader-side per-peer Progress.RecentActive. Remove --leaderBalanceMaxTargetLastContact. - claude #2: the conf-change guard must read len(e.pendingConfigs) under e.pending.Lock() (cancelPendingConfig mutates off the event loop); same for the OQ-5 PendingConfChange Status snapshot. - claude #3: resolve the Progress-exposure hand-wave → engine method, not exported Progress (new OQ-13). - claude #4: maxLag=1024 rationale corrected to throughput-dependent with a tuning formula. Updated §1.3, §3.4, §3.5, §3.6, §4 (PR2), §6, OQ-5/10/12 and added OQ-13 consistently. Status stays Proposed. Doc-only.
|
Round 3 addressed — pushed codex P1 — "Add a real remote catch-up guard before transfer" — ACCEPTED (correct)Verified: etcd/raft v3.6 Fix: the executing group leader applies the §3.5 catch-up predicate ( codex P2 — "Do not use leader Status.LastContact for target liveness" — ACCEPTED (correct)Verified: Fix: drop claude review round 3 (job finished, folded in)
No findings rebutted. @claude review |
|
Claude finished @bootjp's task in 7m 10s —— View job Review:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ef5dbd9b9d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| - **No conf-change in flight — enforced on the executing leader, not only at the balancer's decision time.** Skip a group whose membership is changing — a transfer racing a conf-change can land on a member about to be removed. Two distinct enforcement points are required, because the balancer's decision-time read and the transfer's execution can happen on different nodes (case 2, §3.4): | ||
| - **Decision-time observability (resolves OQ-5).** Surface the existing per-group pending-config state through the exported `Status`: the etcd backend already tracks `pendingConfigs` (`map[uint64]adminRequest`, guarded by `e.pending`, `internal/raftengine/etcd/engine.go:390`, `:563`, populated in `storePendingConfig` `:3138`), but it is not on the public `raftengine.Status` struct (`internal/raftengine/engine.go:67-83`). Add a `PendingConfChange bool` field to `Status` populated from `len(e.pendingConfigs) > 0` (mirroring how `LeadTransferee` was added to the struct), rather than leaking etcd internals via `rawNode.BasicStatus().Config.PendingConfIndex` past the engine boundary or adding a dedicated `Admin` method. This lets the balancer skip a group whose leader it *can* see has a pending conf-change. In **case 1** (the balancer host leads the group) this read is local and authoritative. | ||
| - **Execution-time guard on the group leader (required for case 2).** Decision-time `Status` is **not sufficient for the forwarded path**: in case 2 the balancer is a *follower* of the target group, so its local `Status().PendingConfChange` reflects only the local node and not the remote leader's `pendingConfigs`. And the current `RaftAdmin.TransferLeadership → TransferLeadershipToServer → handleTransferLeadership` path checks only leader state and raft's transfer acceptance (`internal/raftengine/etcd/engine.go:1743-1771`) — it does **not** consult `pendingConfigs`. Surfacing the field in `Status` alone therefore does not stop a forwarded transfer from racing an Add/Remove on the remote leader. PR2 must add a **server-side guard on the executing leader**: `handleTransferLeadership` rejects the transfer with a new `errLeadershipTransferConfChangePending` when `len(e.pendingConfigs) > 0`, before calling `rawNode.TransferLeader`. **The read must acquire `e.pending.Lock()` (claude round-3 #2):** running on the event loop is *not* by itself enough to make `e.pendingConfigs` consistent for a reader, because `cancelPendingConfig` deletes entries under `e.pending.Lock()` from a **non-event-loop goroutine** — the `ctx.Done()` arm of `submitAdminEx` (`internal/raftengine/etcd/engine.go:1366-1367`, `cancelPendingConfig` at `:3158-3165`). Reading `len(e.pendingConfigs)` without the lock is a data race under `-race`. The guard therefore snapshots `hasPending := len(e.pendingConfigs) > 0` under `e.pending.Lock()` and rejects when true. The same locking applies to the `PendingConfChange bool` field on `Status` (OQ-5): the `refreshStatus`/`Status()` snapshot path must read `len(e.pendingConfigs)` under `e.pending.Lock()` before writing the cached `e.status`. The balancer maps the rejection to `leaderbalance_skipped_total{reason=conf_change}` and retries next cycle. This makes the conf-change exclusion enforceable regardless of which node executes the transfer, which fits the §3.4 ownership rule (the executing leader owns the safety gate). **OQ-12** records the guard placement (in `handleTransferLeadership` vs. a check in the `RaftAdmin.TransferLeadership` server wrapper). |
There was a problem hiding this comment.
Check raft's pending config, not request-local state
When a conf-change is proposed and leadership changes before that entry is applied, the new leader will not have the old leader's pendingConfigs entry because this map is populated only by the local admin request and resolved by the encoded request id. etcd/raft still carries the pending conf-change across the new leader (pendingConfIndex is set conservatively on leadership), so this guard would allow a transfer during exactly the in-flight membership change it is meant to exclude. The execution-time check needs to inspect raft's pending/unapplied conf-change state inside the engine, not only len(e.pendingConfigs).
Useful? React with 👍 / 👎.
| |---|---|---|---| | ||
| | **PR0 (prerequisite, §1.1a)** | **Multi-node multi-group topology.** Either (a) extend the bootstrap/flag surface so a group's voter set can span more than one node at startup (per-group members in `--raftGroups` or a companion `--raftBootstrapMembers`-style flag that lifts the `len(groups)==1` guard, `main.go:742-748`), or (b) document + harness the runtime `AddVoter`/`PromoteLearner` composition path (`internal/raftengine/engine.go:225-233`) as the supported way to build multi-voter groups, with an integration topology (extend `scripts/run-jepsen-m5-local.sh`'s single-member-per-group model, `:5-20`) that stands up groups with voters on ≥2 nodes. **OQ-9** decides (a) vs (b). | Bootstrap/flag parsing tests (or AddVoter-composition harness); integration smoke that a group has voters on ≥2 distinct nodes and `TransferLeadershipToServer` succeeds between them. | Yes — unblocks the scheduler; useful on its own (true multi-node multi-group). | | ||
| | **PR1** | Leader-count **observation + metrics only**: a leader-identity reader sibling to `publishLeaderTerms` (`main.go:2126-2143`) that builds the per-node leader-count map on the default-group leader, and exports the `leaderbalance_leaders_per_node` gauge + `leaderbalance_enabled=0`. **No transfers.** Pure observability — shippable even before PR0 (it just observes a trivially-balanced single-voter topology). | Unit: leader-map construction from a fake set of per-group `State()`/`Leader()`; gauge registration. | Yes — observe-only, zero behavior change. | | ||
| | **PR2** (blocked on PR0) | The pure **policy function** (§3.3) + the **scheduler loop** + **transfer execution** behind `--leaderBalance` (default OFF) + runtime kill switch + startup grace (§3.1) + slog + the transfer-result metrics. Local-leader transfer path (case 1, §3.4) and forwarded path (case 2, single preferred target, leader-validated, §3.5). Eligibility: leader-count map **seeded with all voters at zero** (§3.2) + "no-leader skip" + per-group/global cooldowns + startup grace + conf-change skip + `LeadTransferee != 0` skip. The transfer **executes on the group leader**, which owns **all** target catch-up + liveness validation via a **new engine-side guard** — add `TransferLeadershipToServerIfEligible(ctx, id, addr, maxLag)` to the engine `Admin` interface; its `handleTransferLeadership` body checks `Progress.Match >= Commit - maxLag` **and** `Progress.RecentActive` against the leader-only `e.rawNode.Status().Progress[target]` *before* `rawNode.TransferLeader`, returning `errLeadershipTransferTargetNotCaughtUp` (mirrors `handlePromoteLearner`, §3.5). **Progress exposure lands here at the guard, applied identically to case 1 (local engine) and case 2 (remote leader) — there is no separate case-1-only `Status.Progress` exposure.** The same `handleTransferLeadership` also gets the execution-time conf-change guard (rejects when `len(e.pendingConfigs) > 0`, read **under `e.pending.Lock()`** — `cancelPendingConfig` mutates from a non-event-loop goroutine, §3.5 / OQ-12) plus the `PendingConfChange` field on `Status` (also read under `e.pending.Lock()`, OQ-5). The forwarded path must route `RaftAdmin.TransferLeadership` to the gated `…IfEligible` form (§3.5). | Unit (table-driven): policy decisions over crafted leader maps (imbalance threshold, source/target choice, tie-breaks, strict-spread-decrease guard, cooldown gating, startup-grace gating). Engine-level: `TransferLeadershipToServerIfEligible` rejects a lagging / not-`RecentActive` target *before* `rawNode.TransferLeader` (no proposals paused) and the conf-change guard rejects under a pending conf-change — both run under `-race`. Integration: **the PR0 multi-voter-group topology** — force all leaders onto one node, enable `--leaderBalance`, assert convergence to ≤ `⌈groups/nodes⌉` per node. Kill-switch + leader-change-reset + startup-grace tests. | After PR0 — completes count-based balancing. | |
There was a problem hiding this comment.
Move SQS refusal exclusion into first transfer PR
In deployments with partitioned FIFO groups, PR2 is marked as completing count-based balancing and can be enabled before PR3, but its eligibility list omits the SQS/refusal exclusion that this document says is required to avoid guaranteed ping-pong when a group is transferred to a node that immediately refuses it. Either include the partitioned-FIFO/pinning exclusion in the first transfer-issuing milestone or mark PR2 as not independently usable for clusters with those groups.
Useful? React with 👍 / 👎.
|
@claude review Round-4 dispositions (pushed Finding #1 —
Finding #2 — "change the existing handler" sub-option silently gates Additional consistency item I caught while resolving #1/#2 (not in the review, flagged for transparency): Positively-verified items from the review: left unchanged (lastContact/leader-zero No new codex/gemini inline findings at HEAD |
|
Claude finished @bootjp's task in 8m 36s —— View job Review:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1304cfb394
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - **`Status.LastContact` is the WRONG signal for target liveness and is identically 0 on a leader (codex P2 / claude #1).** Round 2 specified a case-1 liveness gate of `Status.LastContact < electionTimeout/2`. That is ineffective: `lastContactFor` returns `0` unconditionally when the local engine is `StateLeader` (`internal/raftengine/etcd/engine.go:3481-3483`), and the field it derives from is a **single** value (`e.lastLeaderContactFrom` / `e.lastLeaderContactAt`, set in `recordLeaderContact`, `:3540-3546`) describing *this follower's contact with ITS leader* — there is no per-peer leader→follower tracking. In a case-1 transfer the balancer host **is** the leader, so its own `Status().LastContact` is always `0`, `0 < electionTimeout/2` is trivially true, and the gate would pass even for a partitioned/dead target. The correct per-peer leader-side liveness signal is **`Progress.RecentActive`** (`go.etcd.io/raft/v3 tracker/progress.go:87-91` — "true if the progress is recently active … reset to false after an election timeout"), which is in the leader-only `Progress` map alongside `Match`. | ||
| - **Ownership rule (unchanged, now exhaustive):** target catch-up *and* liveness validation is the **responsibility of the node that executes the transfer — which is always the group leader** (case 1: the balancer host itself; case 2: the remote source-group leader the request is forwarded to, §3.4). The group leader is the only node that holds the live `Progress` map. The balancer host does **not** assert "target X is caught up / live"; it expresses a **load preference** (the least-loaded eligible voter, or an ordered preference list least-loaded→most-loaded) and the executing leader applies catch-up + liveness as the final, authoritative gate. This holds for **both** sub-cases — there is no longer a separate "case-1 pre-filter on `Status`" path, because the only place the signal exists is inside the engine on the leader. | ||
| - **Concrete eligibility predicate, enforced by a new engine-side guard `handleTransferLeadership` runs *before* `rawNode.TransferLeader`** (mirroring how `handlePromoteLearner` already gates on `Progress.Match`, `internal/raftengine/etcd/engine.go:1674-1688`). For the resolved `target.NodeID`, read `e.rawNode.Status().Progress[target.NodeID]` on the event loop and require **all** of: | ||
| - **Caught up:** `progress.Match >= status.Commit - maxLag` (entry units; default **`--leaderBalanceMaxTargetLag = 1024` entries**). On the executing leader both `progress.Match` and the commit index are local reads. A transfer to a target more than `maxLag` behind is rejected before it can pause proposals. **`maxLag = 1024` is a conservative default that keeps the transfer window short at moderate append throughput; it is NOT a fixed "one election timeout" of log — at high write rates 1024 entries is a small fraction of an election timeout (and would reject most targets), at low rates it is many.** Operators running high-throughput workloads should raise it toward roughly `raftElectionTimeoutTicks × tickInterval × raftMaxEntriesPerMsg` entries (the log a healthy follower can fall behind within one election timeout at that throughput); tighter values risk never finding an eligible target under steady write load, looser values risk a longer write stall on the new leader's group while raft catches the target up. |
There was a problem hiding this comment.
Saturate the commit-minus-lag check
With the proposed uint64 predicate, clusters whose commit index is below the default lag budget hit unsigned underflow (status.Commit - maxLag becomes a huge value), so even fully caught-up followers are rejected and the PR2 convergence test can never move leaders in a fresh/low-traffic cluster. Make the threshold saturating, e.g. compare against zero when Commit <= maxLag, before relying on this guard.
Useful? React with 👍 / 👎.
| 2. **The over-loaded source node is a *different* node.** The balancer host is a follower of that group and cannot initiate the transfer locally. It **forwards** the request to the source node's leader of that group, using the in-tree per-group `RaftAdmin` gRPC service: dial the source node's group listener (`rt.spec.address` for that group is in the group's `Configuration`, available via `engine.Configuration(ctx)`, `internal/raftengine/engine.go:213-215`) and call `RaftAdmin.TransferLeadership` with `TargetId`/`TargetAddress` set to the chosen target **and the new `max_lag` field set to `--leaderBalanceMaxTargetLag`** (see the proto extension below). The receiving node's engine is the leader of that group, so `handleTransferLeadership` accepts it — and because `max_lag > 0` the handler routes to the **gated** `TransferLeadershipToServerIfEligible` (§3.5), so the remote leader's engine-side catch-up/liveness guard runs there. | ||
|
|
||
| **Forward-path routing mechanism — decision: extend `RaftAdminTransferLeadershipRequest` with an optional `max_lag` field (resolves OQ-4).** Today `RaftAdminTransferLeadershipRequest` carries only `target_id = 1` and `target_address = 2` (`proto/service.proto:248-251`), and the handler maps unconditionally to the **un-gated** `TransferLeadershipToServer` (`internal/raftadmin/server.go:163`). There is no field on the wire that can carry the balancer's `maxLag` to the remote leader, so the case-2 forwarded path has **no way to reach the gated `…IfEligible` engine method** without a wire change. PR2 adds an optional `uint64 max_lag = 3` field (the next free field number) to `RaftAdminTransferLeadershipRequest`: | ||
| - **Backward compatible by proto3 construction.** An absent field decodes to the zero value, so any existing client — every `cmd/raftadmin` call, which builds the request with only `TargetId`/`TargetAddress` set (`parseTransferTarget`, `cmd/raftadmin/main.go:370-378`) — sends `max_lag = 0` implicitly. Old and new binaries interoperate in both directions; no `cmd/raftadmin` change is required. |
There was a problem hiding this comment.
Gate mixed-version forwards before claiming safety
This compatibility claim is unsafe for rolling upgrades: an old RaftAdmin server ignores the new unknown max_lag field and its current handler unconditionally calls the ungated TransferLeadershipToServer (internal/raftadmin/server.go:155-172), so a new balancer forwarding to an old group leader can still transfer to a lagging target and stall writes. The design should require all group leaders to advertise/support the gated handler before enabling the balancer, or otherwise fall back to no-op/retry instead of treating old/new interop as safe.
Useful? React with 👍 / 👎.
P1 (codex): fix the §3.5 catch-up predicate uint64 underflow. The naive 'Progress.Match >= Commit - maxLag' wraps when Commit < maxLag (young / fresh / low-traffic group), rejecting every target and stalling the balancer. Specify the underflow-safe 'Commit - Match <= maxLag' form (Match <= Commit holds for any non-self follower) and the equivalent saturating-threshold reformulation. Propagate the corrected form to all references (§3.4, §3.5, §4 PR2, §6, OQ-13) and add the young-group test case (Commit < maxLag, caught-up target accepted) to §5 + PR2 tests. P2 (codex): gate mixed-version forwards. An old RaftAdmin server drops the unknown max_lag proto3 field and executes the forwarded transfer ungated. Add (1) structural ordering — the gated handler and the only emitter (balancer) ship in the same PR2 binary, so single-binary skew is impossible; (2) an operator rollout rule (do not enable --leaderBalance until all nodes run the gated binary), mirroring the encryption mutator rolling-restart discipline; (3) an optional transfer_gate capability pre-check (OQ-14) reusing the in-tree GetCapability fan-out precedent, with a peer_ungated skip metric. Folded into §3.4, §3.6, §4 PR2, §6. Status stays Proposed; doc-only.
|
@claude review Round 5 — addressed both codex findings (pushed codex P1 "Saturate the commit-minus-lag check" — ACCEPTED (verified).
codex P2 "Gate mixed-version forwards" — ACCEPTED (verified).
Verification notes: proto3 unknown-field semantics confirmed against the in-tree The latest claude review job (round 5, 1304cfb) was still in progress at push time with no findings posted; nothing to fold in. Re-requesting review on |
|
Claude finished @bootjp's task in 6m 3s —— View job Review:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 16710a56a0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - **`Status.LastContact` is the WRONG signal for target liveness and is identically 0 on a leader (codex P2 / claude #1).** Round 2 specified a case-1 liveness gate of `Status.LastContact < electionTimeout/2`. That is ineffective: `lastContactFor` returns `0` unconditionally when the local engine is `StateLeader` (`internal/raftengine/etcd/engine.go:3481-3483`), and the field it derives from is a **single** value (`e.lastLeaderContactFrom` / `e.lastLeaderContactAt`, set in `recordLeaderContact`, `:3540-3546`) describing *this follower's contact with ITS leader* — there is no per-peer leader→follower tracking. In a case-1 transfer the balancer host **is** the leader, so its own `Status().LastContact` is always `0`, `0 < electionTimeout/2` is trivially true, and the gate would pass even for a partitioned/dead target. The correct per-peer leader-side liveness signal is **`Progress.RecentActive`** (`go.etcd.io/raft/v3 tracker/progress.go:87-91` — "true if the progress is recently active … reset to false after an election timeout"), which is in the leader-only `Progress` map alongside `Match`. | ||
| - **Ownership rule (unchanged, now exhaustive):** target catch-up *and* liveness validation is the **responsibility of the node that executes the transfer — which is always the group leader** (case 1: the balancer host itself; case 2: the remote source-group leader the request is forwarded to, §3.4). The group leader is the only node that holds the live `Progress` map. The balancer host does **not** assert "target X is caught up / live"; it expresses a **load preference** (the least-loaded eligible voter, or an ordered preference list least-loaded→most-loaded) and the executing leader applies catch-up + liveness as the final, authoritative gate. This holds for **both** sub-cases — there is no longer a separate "case-1 pre-filter on `Status`" path, because the only place the signal exists is inside the engine on the leader. | ||
| - **Concrete eligibility predicate, enforced by a new engine-side guard `handleTransferLeadership` runs *before* `rawNode.TransferLeader`** (mirroring how `handlePromoteLearner` already gates on `Progress.Match`, `internal/raftengine/etcd/engine.go:1674-1688`). For the resolved `target.NodeID`, read `e.rawNode.Status().Progress[target.NodeID]` on the event loop and require **all** of: | ||
| - **Caught up — saturating / underflow-safe comparison (resolves codex round-5 P1):** the lag check must be written so it never underflows `uint64`. The naïve form `progress.Match >= status.Commit - maxLag` is **wrong**: `status.Commit` and `maxLag` are both `uint64`, so when `status.Commit < maxLag` — the steady state of any *young* group whose commit index has not yet reached the default `1024`-entry budget (a fresh cluster, a freshly-split group, a low-traffic group) — `status.Commit - maxLag` wraps to a value near `2^64`. The predicate then rejects **every** target, including a follower that is perfectly caught up (`Match == Commit`), so the balancer is permanently stuck and the PR2 convergence test can never move a leader on a fresh/low-traffic cluster. The guard must instead compare the **gap on the left**, which cannot underflow: `status.Commit - progress.Match <= maxLag`. This subtraction is safe because **`progress.Match <= status.Commit` always holds for any follower** — a leader only advances its own commit index after a quorum (including itself) has the entry, and a follower's `Match` is the highest index the leader has confirmed *replicated to that follower*, which is bounded above by what the leader itself holds and therefore by the commit index; etcd/raft never records a `Match` ahead of the leader's last index, let alone its commit. (The one value that can momentarily exceed the leader's commit is the leader's *own* `Match`, but the guard never runs against self — `resolveTransferTarget` rejects self, §3.5 — so the subtrahend is always a real follower's `Match ≤ Commit`.) Equivalently, an explicit saturating threshold `threshold := status.Commit; if status.Commit > maxLag { threshold = status.Commit - maxLag } else { threshold = 0 }` then `progress.Match >= threshold` yields the same result without relying on the `Match ≤ Commit` argument; **the `Commit - Match <= maxLag` form is preferred** because it is a single branch-free expression and self-documents the "gap ≤ budget" intent. Units are log entries; default **`--leaderBalanceMaxTargetLag = 1024` entries**. On the executing leader both `progress.Match` and the commit index are local reads. A transfer to a target more than `maxLag` behind is rejected before it can pause proposals. **`maxLag = 1024` is a conservative default that keeps the transfer window short at moderate append throughput; it is NOT a fixed "one election timeout" of log — at high write rates 1024 entries is a small fraction of an election timeout (and would reject most targets), at low rates it is many.** Operators running high-throughput workloads should raise it toward roughly `raftElectionTimeoutTicks × tickInterval × raftMaxEntriesPerMsg` entries (the log a healthy follower can fall behind within one election timeout at that throughput); tighter values risk never finding an eligible target under steady write load, looser values risk a longer write stall on the new leader's group while raft catches the target up. Note the young-group case makes the gate **permissive** on a fresh cluster (`Commit - Match` is small, usually 0), which is correct: there is no large backlog to catch up on, so any live voter is a safe target — the danger the gate exists to prevent (a target tens of thousands of entries behind) only arises once the log is long. |
There was a problem hiding this comment.
Avoid underflow when Match exceeds Commit
In a leader with uncommitted entries, a follower's Progress.Match can be ahead of the leader's committed index (for example, the target has acknowledged entries that have not reached quorum or are from a prior term), so status.Commit - progress.Match can still underflow. In that scenario the proposed branch-free gate rejects a target that is at least caught up to the commit point, which can make balancing fail spuriously; use a saturating gap such as zero when Match >= Commit rather than relying on Match <= Commit.
Useful? React with 👍 / 👎.
|
|
||
| **Forward-path routing mechanism — decision: extend `RaftAdminTransferLeadershipRequest` with an optional `max_lag` field (resolves OQ-4).** Today `RaftAdminTransferLeadershipRequest` carries only `target_id = 1` and `target_address = 2` (`proto/service.proto:248-251`), and the handler maps unconditionally to the **un-gated** `TransferLeadershipToServer` (`internal/raftadmin/server.go:163`). There is no field on the wire that can carry the balancer's `maxLag` to the remote leader, so the case-2 forwarded path has **no way to reach the gated `…IfEligible` engine method** without a wire change. PR2 adds an optional `uint64 max_lag = 3` field (the next free field number) to `RaftAdminTransferLeadershipRequest`: | ||
| - **Backward compatible by proto3 construction.** An absent field decodes to the zero value, so any existing client — every `cmd/raftadmin` call, which builds the request with only `TargetId`/`TargetAddress` set (`parseTransferTarget`, `cmd/raftadmin/main.go:370-378`) — sends `max_lag = 0` implicitly. Old and new binaries interoperate in both directions; no `cmd/raftadmin` change is required. | ||
| - **Single routing rule in the server handler.** `RaftAdmin.TransferLeadership` (`internal/raftadmin/server.go:155-172`) routes on the field: when `req.MaxLag > 0` it calls the **gated** `TransferLeadershipToServerIfEligible(ctx, target.ID, target.Address, req.MaxLag)`; when `req.MaxLag == 0` (the existing default for every operator call) it keeps calling the **un-gated** `TransferLeadershipToServer` exactly as today. |
There was a problem hiding this comment.
Preserve gated routing when max lag is zero
Routing req.MaxLag == 0 to the ungated method makes --leaderBalanceMaxTargetLag=0 a dangerous footgun: an operator trying to require an exactly caught-up target would instead disable the balancer's catch-up/liveness guard and allow transfers to lagging followers that can stall writes. Since zero is a valid strict lag budget, distinguish automated-vs-operator requests by field presence or an explicit mode bit rather than by the numeric value alone.
Useful? React with 👍 / 👎.
Round-6 review found the round-5 P1 fix introduced an incorrect safety argument: the preferred `Commit - Match <= maxLag` form underflows uint64 when Match > Commit, which CAN stably occur for a real follower in any cluster with quorum > 2 (e.g. a 5-node group: a follower acks entry N and Progress.Match advances on the MsgAppResp, but CommitIndex lags until a quorum holds N). The doc also wrongly claimed the two subtraction forms are equivalent and that Match <= Commit is a universal invariant. Promote the explicit saturating form as THE recommended implementation target across every place the predicate appears (3.4/3.5/4 PR2/5/6/OQ-13): threshold = Commit > maxLag ? Commit - maxLag : 0; eligible = Match >= threshold Mark BOTH bare subtraction one-liners as incorrect, each with its underflow direction, and remove the equivalence/invariant claim. Extend the 5 test list to cover the Match > Commit counterexample (Commit=80, Match=100 -> accepted) and widen the rapid property invariant range from Match in [0, Commit] to Match in [0, Commit + maxLag] so the underflow case is actually generated. Doc-only. Status: Proposed.
|
@claude review Round-6 targeted correction (a19dc1f, doc-only, Status: Proposed). You were right: the round-5 P1 fix introduced an incorrect safety argument. The preferred |
|
Claude finished @bootjp's task in 5m 10s —— View job Review:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a19dc1f325
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - **Source node** = the node with the **most** leaders (tie-break: lexicographically smallest node ID, so the decision is reproducible in tests). | ||
| - **Group to move** = among the groups led by the source node that are *eligible* (§3.5), pick deterministically (tie-break by group ID ascending). | ||
| - **Target node** = among nodes that are **voters of that group**, **healthy** (§3.5), and **not policy-excluded for that group** (§3.5), the one with the **fewest** leaders (tie-break: smallest node ID). The transfer is skipped if moving the leader to the target would not strictly reduce the spread (i.e. `targetCount + 1 > sourceCount - 1`), so a transfer never makes things worse or merely shuffles a tie. |
There was a problem hiding this comment.
Fall back to the next eligible source node
When the most-loaded node has no transferable groups because its leaders are pinned, in cooldown, mid-conf-change, or otherwise ineligible, this fixed source choice makes the scheduler skip even if the second-most-loaded node can still reduce the spread. For example, with counts A=5, B=4, C=0 and all A-led groups pinned, moving one eligible B-led group to C reduces the spread from 5 to 4, but this policy never considers B after selecting A. The decision function should rank candidate source nodes by count and choose the first eligible move that strictly reduces spread, rather than stopping at the absolute max node.
Useful? React with 👍 / 👎.
|
@claude review Round 8 (e789c20) — doc-only, addresses the two remaining codex P2 items and the claude clarity note. Status stays Chosen resolution for codex P2 #3393836081 ("Preserve gated routing when max lag is zero"): Option (b) — a separate
codex P2 #3393909075 ("Fall back to the next eligible source node"): fixed in §3.3. The policy no longer stops at the single most-loaded node — it iterates source candidates in descending leader-count order (tie-break smallest node ID) and takes the first with an eligible claude clarity note (property-test reference notation): clarified §5 line — the No newer top-level review comments after 2026-06-11T07:00:00Z. Doc-only; pushed |
|
Claude encountered an error after 6m 48s —— View job Review:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e789c2046b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - **No conf-change in flight — enforced on the executing leader, not only at the balancer's decision time.** Skip a group whose membership is changing — a transfer racing a conf-change can land on a member about to be removed. Two distinct enforcement points are required, because the balancer's decision-time read and the transfer's execution can happen on different nodes (case 2, §3.4): | ||
| - **Decision-time observability (resolves OQ-5).** Surface the existing per-group pending-config state through the exported `Status`: the etcd backend already tracks `pendingConfigs` (`map[uint64]adminRequest`, guarded by `e.pending`, `internal/raftengine/etcd/engine.go:390`, `:563`, populated in `storePendingConfig` `:3138`), but it is not on the public `raftengine.Status` struct (`internal/raftengine/engine.go:67-83`). Add a `PendingConfChange bool` field to `Status` populated from `len(e.pendingConfigs) > 0` (mirroring how `LeadTransferee` was added to the struct), rather than leaking etcd internals via `rawNode.BasicStatus().Config.PendingConfIndex` past the engine boundary or adding a dedicated `Admin` method. This lets the balancer skip a group whose leader it *can* see has a pending conf-change. In **case 1** (the balancer host leads the group) this read is local and authoritative. | ||
| - **Execution-time guard on the group leader (required for case 2).** Decision-time `Status` is **not sufficient for the forwarded path**: in case 2 the balancer is a *follower* of the target group, so its local `Status().PendingConfChange` reflects only the local node and not the remote leader's `pendingConfigs`. And the current `RaftAdmin.TransferLeadership → TransferLeadershipToServer → handleTransferLeadership` path checks only leader state and raft's transfer acceptance (`internal/raftengine/etcd/engine.go:1743-1771`) — it does **not** consult `pendingConfigs`. Surfacing the field in `Status` alone therefore does not stop a forwarded transfer from racing an Add/Remove on the remote leader. PR2 must add a **server-side guard on the executing leader**: `handleTransferLeadership` rejects the transfer with a new `errLeadershipTransferConfChangePending` when `len(e.pendingConfigs) > 0`, before calling `rawNode.TransferLeader`. **The read must acquire `e.pending.Lock()` (claude round-3 #2):** running on the event loop is *not* by itself enough to make `e.pendingConfigs` consistent for a reader, because `cancelPendingConfig` deletes entries under `e.pending.Lock()` from a **non-event-loop goroutine** — the `ctx.Done()` arm of `submitAdminEx` (`internal/raftengine/etcd/engine.go:1366-1367`, `cancelPendingConfig` at `:3158-3165`). Reading `len(e.pendingConfigs)` without the lock is a data race under `-race`. The guard therefore snapshots `hasPending := len(e.pendingConfigs) > 0` under `e.pending.Lock()` and rejects when true. The same locking applies to the `PendingConfChange bool` field on `Status` (OQ-5): the `refreshStatus`/`Status()` snapshot path must read `len(e.pendingConfigs)` under `e.pending.Lock()` before writing the cached `e.status`. The balancer maps the rejection to `leaderbalance_skipped_total{reason=conf_change}` and retries next cycle. This makes the conf-change exclusion enforceable regardless of which node executes the transfer, which fits the §3.4 ownership rule (the executing leader owns the safety gate). **Unlike the catch-up/liveness gate (which is selected by the gated engine method — i.e. by `gated = true` on the wire — so it touches only the balancer path), the conf-change guard is *unconditional* — it applies to every transfer regardless of method or `maxLag`, including the operator `cmd/raftadmin` path.** That asymmetry is deliberate and is *not* a re-run of the catch-up-gate contradiction: a transfer racing a membership change can land leadership on a member about to be removed, which is a correctness hazard for *any* caller — there is no legitimate operator reason to force a transfer into an in-flight conf-change, whereas there *is* a legitimate operator reason to force a transfer onto a lagging follower (disaster recovery). So the conf-change guard correctly has no opt-out, while the catch-up gate correctly does (via `gated = false` / the un-gated method — **not** via `maxLag = 0`, which under the gated method is the strictest budget). **OQ-12** records the guard placement (in `handleTransferLeadership` vs. a check in the `RaftAdmin.TransferLeadership` server wrapper). | ||
| - **No transfer already in flight for the group.** Skip a group whose leader already reports `Status().LeadTransferee != 0` (`internal/raftengine/engine.go:80-82`, populated on the leader, `etcd/engine.go:1431`). Issuing a second `TransferLeadershipToServer` while the first is still settling in `waitForLeadershipTransfer` would hit `errLeadershipTransferRejected` (`etcd/engine.go:1765-1768`) and log a spurious failure. In case 1 the balancer reads this locally; in case 2 the remote leader enforces it at submit time. Counted under `leaderbalance_skipped_total{reason=transfer_in_flight}`. |
There was a problem hiding this comment.
Add an execution-time in-flight-transfer guard
In the forwarded case where the balancer is only a follower of this group, its local Status().LeadTransferee cannot prove the remote leader is idle, and etcd/raft does not reject a second transfer to a different target: in v3.6 MsgTransferLeader aborts the previous transfer and starts the new one (raft.go:1637-1645). Unless PR2 also adds an explicit BasicStatus().LeadTransferee != 0 rejection inside handleTransferLeadership before rawNode.TransferLeader, this “remote leader enforces it at submit time” statement is false and the scheduler can cancel an existing/manual transfer and extend the write-stall window.
Useful? React with 👍 / 👎.
|
@claude review |
|
Claude finished @bootjp's task in 6m 47s —— View job Review:
|
|
@claude review Disposition: must-fix resolved — codex round-9 P2 (execution-time in-flight-transfer guard). Doc-only, Root cause confirmed against the tree. Fix (§3.5 'No transfer already in flight' bullet, rewritten): PR2 adds an execution-time guard in Scope decision (stated + justified): unconditional — applies to BOTH the gated balancer path and the ungated Consistent edits: §3.5 bullet (false claims corrected + decision-time/execution-time split + scope justification), §3.5 case-2 sub-case, skip-metric reason Five-lens (doc-only): data-loss — none (no FSM/write change). Concurrency — the change closes a silent-abort race; guard is event-loop-local, no new lock. Performance — one extra |
|
Claude finished @bootjp's task in 3m 5s —— View job Review:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 14222f2744
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - **Default group is balanceable, with a called-out blip.** The default group is *not* excluded — leaving it pinned to one node defeats half the point (that node also carries HLC/catalog work). But moving the default group's leadership has two transient effects the operator must understand: | ||
| - **HLC ceiling renewal** restarts on the new leader. The ceiling is proposed every `hlcRenewalInterval` (1 s) with a `hlcPhysicalWindowMs` (3 s) window (`kv/coordinator.go:37-46`, `:644-669`); because the window (3 s) exceeds the renewal interval (1 s) and a newly elected leader clamps `Next()` to `max(wall, ceiling)`, a clean transfer does **not** let the new leader issue a timestamp inside the old leader's window — the safety invariant holds across a transfer exactly as it does across a natural election. | ||
| - **Lease reads:** a transfer invalidates the lease on the old leader (`RegisterLeaderLossCallback → lease.invalidate`, `kv/coordinator.go:131`, `kv/sharded_coordinator.go:584`), so the new leader's *first* read takes the slow `LinearizableRead` path (one ReadIndex round-trip) before its lease warms again — a single read-latency blip, not a correctness issue. This is identical to the blip on any natural election. The cooldowns keep these blips rare. | ||
| - **Recommendation (resolving OQ-3): balance the default group LAST.** The default group is only added to the eligible-source set for a cycle when **every other group's source node is already at or below `⌈groups / nodes⌉`** — i.e. the cheap, side-effect-free moves are exhausted first and the default group is touched only if it is still the thing keeping the cluster imbalanced. This defers the two transient effects above (HLC-renewal restart + one lease-read blip, plus the recursive balancer-host churn risk of §3.1, since the balancer rides the default-group leader) until they are the *only* remaining way to reduce the spread. An operator who wants the default group pinned entirely uses `--leaderBalancePinGroups` (§3.6) to exclude it. This is strictly safer than "balance first" and strictly more useful than "pin by default" (which would leave the HLC/catalog node permanently over-loaded). See §8 OQ-3. |
There was a problem hiding this comment.
Allow default group after ineligible groups block progress
With this rule, a cluster can remain permanently imbalanced when the non-default groups on the overloaded node are pinned/refused/in cooldown/mid-conf-change but the default group is the only eligible move. For example, if A leads the default group plus several pinned non-default groups and B/C lead none, “every other group's source node” is still above ceil, so the default group is never admitted even though moving it would strictly reduce the spread. The “balance last” policy should trigger after there is no eligible non-default reducing move, not based only on the other groups’ current source counts.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit 67aeb05. §3.5 line 174 is now: pass 1 considers non-default groups only; pass 2 admits the default group only if pass 1 found no eligible reducing move. So the worked example (A={default, P1..P4 all pinned}, B=∅, C=∅, ceil=2) now issues a default-group transfer instead of deadlocking — eligibility-based admission, not a raw-count gate that pinned groups can never satisfy. §3.3 cross-reference updated to point at the two-pass shape; the inner deterministic-iteration semantics, the strict-spread-reducing guard, and "one transfer per cycle" are unchanged.
…w counts Codex round-7 P2 on PR #953 (review 4475297729, line 174): the round-6 admission rule for the default group — "every other group's source node is at or below ceil" — is computed on raw leader counts, which include groups that are categorically ineligible to move this cycle (pinned via --leaderBalancePinGroups, refusing leadership via the SQS htfifo exclusion, in per-group cooldown, in a conf-change, or with a transfer already in flight). When the over-loaded node leads a mix of "default + pinned non-default", the raw counts never drop and the default group is never admitted even though it is the only remaining eligible reducing move — the cluster deadlocks under heavy pinning. Replace the raw-count gate with a two-pass iteration of §3.3: pass 1: non-default candidates only pass 2: default group admitted, but only if pass 1 found nothing This preserves "balance last" (default is touched only after the non-default candidate set is exhausted this cycle) while making the gate progress-preserving under arbitrary pinning patterns. The worked example A={default,P1..P4 all pinned}, B=∅, C=∅, ceil=2 now correctly issues a default-group transfer instead of skipping. §3.3 cross-reference updated to point at the new two-pass shape; the inner deterministic-iteration semantics, the strict-spread-reducing guard, and "one transfer per cycle" are unchanged. No code change required (doc-only, this is a proposal).
|
@claude review — addressed PR #953 round 7 review (3 findings, all from HEAD-commit 14222f2):
|
|
Claude finished @bootjp's task in 2m 52s —— View job Review:
|
…-3 rule update Claude round-10 review on PR #953 (commit 67aeb05) — two doc consistency issues introduced by the two-pass default-group admission fix: 1. §3.6 leaderbalance_skipped_total{reason} enum was missing no_eligible_move — the label §3.5 line 176 emits when both passes exhaust the eligible source set without finding a strict- spread-reducing move (imbalance >= threshold but every candidate is pinned / in cooldown / mid-conf-change / unhealthy / refused). Distinct from below_threshold (spread < imbalanceThreshold). Adding to the enum so the metric registration test catches it and operators can configure dashboards against the complete enumerated list. 2. §8 OQ-3 still described the superseded raw-count gate; updated to the eligibility-based two-pass rule that 67aeb05 introduced, with a backpointer to the round-7 P2 motivation (raw counts can be permanently above ceil under pinning, deadlocking default- group admission).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5e4e779d97
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - **Imbalance trigger (hysteresis):** act only when `max(count) - min(count) >= imbalanceThreshold` (default **2**). A spread of 1 is the unavoidable remainder when `groups` is not divisible by `nodes` and must never trigger a transfer — otherwise the scheduler ping-pongs forever against the arithmetic. A threshold of 2 is the smallest value that is provably stable at the optimum. | ||
| - **Source / group / target choice (deterministic, with fall-through to the next eligible source — resolves codex round-6 P2 "fall back to the next eligible source node"):** the policy does **not** stop at the single most-loaded node. It **iterates candidate source nodes in descending leader-count order** (tie-break: lexicographically smallest node ID, so the order is reproducible in tests) and, for each, looks for an eligible `(group, target)` pair that **strictly reduces the spread**; it takes the **first** such pair and stops. This is the fix for the case where the absolute-max node has *no* transferable group — all its leaderships are pinned (§3.5), in per-group cooldown (§3.3), or mid-conf-change (§3.5) — yet the **second**-most-loaded node has an eligible move that still reduces the spread. The earlier "pick the single max node and skip the cycle if it has no eligible group" rule wrongly left the cluster imbalanced in exactly that case (e.g. `A=5` all-pinned, `B=4`, `C=0` → moving a `B`-led group to `C` drops the spread from 5 to 4, but the max-only rule never considers `B`). Concretely, for each candidate source `s` taken in descending-count order: | ||
| - **Group to move** = among the groups led by `s` that are *eligible* (§3.5: not pinned, not in per-group cooldown, no conf-change in flight, no transfer already in flight), pick deterministically (tie-break by group ID ascending). | ||
| - **Target node** = among nodes that are **voters of that group**, **healthy** (§3.5), and **not policy-excluded for that group** (§3.5), the one with the **fewest** leaders (tie-break: smallest node ID). The candidate move is accepted only if it **strictly reduces the spread** (`targetCount + 1 > sourceCount - 1` ⇒ reject), so a transfer never makes things worse or merely shuffles a tie. |
There was a problem hiding this comment.
Compute the post-move global spread
When there are multiple least-loaded nodes, this source/target-only predicate can accept a move that does not strictly reduce the cluster spread. For example, with counts A=5 (no eligible moves), B=4, C=0, D=0, moving one B group to C passes this check (1 > 3 is false) but leaves D at 0, so the global spread remains 5 instead of decreasing. Since the policy and tests rely on the “strictly reduces the spread” invariant, the decision function should compare the full leader-count map before/after the candidate move, not only sourceCount and targetCount.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in commit 99d8ea5. §3.3 now requires strict reduction of the GLOBAL spread, computed from the full post-move count map (m' = m with [source]--, [target]++), not just source/target counts. The worked example A=5 (pinned), B=4, C=0, D=0 is now correctly rejected — the post-move map {A=5, B=3, C=1, D=0} still has spread=5 (A-D). The §5 convergence test gains this exact regression case.
|
@claude review — round 10 (2 corrections + minor note):
|
|
Claude finished @bootjp's task in 4m 21s —— View job Review:
|
Codex round-8 P2 on PR #953 (commit 5e4e779, line 93) — the source/target-only spread check 'targetCount + 1 > sourceCount - 1' misses cases where a third node holds the min or max of the leader-count map. Example: A=5 (pinned), B=4, C=0, D=0. Moving a B-led group to C is admitted by the source/target-only rule (1 > 3 is false), but the post-move map {A=5, B=3, C=1, D=0} still has spread = 5 (A - D), unchanged — the policy claims to strictly reduce the spread but doesn't. Fix: compute the full post-move count map and reject when spread(m') >= spread(m). The implementation cost is trivial (one slice copy plus two integer mutations on the same leaderCountMap the policy already builds in §3.2), but it correctly catches the class of cases where a non-source / non-target node holds the spread's min or max — exactly the regression the §5 convergence test now exercises as A=5 pinned, B=4, C=D=0.
|
Claude finished @bootjp's task in 6m 10s —— View job Review:
|
| Candidate | Pre-spread | Post-move map | Post-spread | Accepted? |
|---|---|---|---|---|
| B → C | 5 (A=5 − D=0) | {A=5, B=3, C=1, D=0} | 5 (A=5 − D=0) | Rejected |
| B → D | 5 (A=5 − D=0) | {A=5, B=3, C=0, D=1} | 5 (A=5 − C=0) | Also rejected |
Both candidates fail the full-map check. The phrase "must instead pick D as the target" incorrectly implies D is the correct, chosen answer. An implementer following this literally would return B→D — a transfer the full-map rule itself forbids, because C remains at 0 and keeps the global minimum pinned at 0.
Correct behavior: the algorithm iterates targets in ascending node-ID order, tries C (rejected — D at 0 keeps global min), then tries D (also rejected — C at 0 keeps global min), and falls through to no_eligible_move for source B. No transfer is issued this cycle.
Replace the parenthetical with one that states the expected outcome unambiguously:
"(Both candidates fail: B→D gives {A=5, B=3, C=0, D=1}, spread = 5 — C stays at 0, keeping the global minimum unchanged. B has no eligible target in this scenario; the iteration falls through to
no_eligible_move. The two-tied-minimum shape is the case the full-map check exists for: with a pinned node holding the global max and two nodes at 0, no single transfer can reduce the global spread — the scheduler correctly defers until conditions change or the default-group pass (§3.5) offers an alternative.)"
Also add an explicit unit test entry to §5 for this shape:
"4-node case — both tied-min targets fail (codex round-8 P2): A=5 (all A-led groups pinned), B=4, C=0, D=0, an eligible B-led group whose voter set includes C and D → try B→C (spread stays 5, rejected), try B→D (spread stays 5, rejected), no eligible B target → policy emits
no_eligible_move. Contrast with the 3-node shape A=5-pinned, B=4, C=0 (same test at §5 line 209): B→C gives {A=5, B=3, C=1}, spread = 4, which IS accepted — the 3-node case has no D at 0 to keep the global min pinned."
The 3-node A=5, B=4, C=0 → B→C: spread 5→4 case remains correct and tests a different (valid-transfer) shape. The 4-node case tests the no_eligible_move branch and is the critical regression introduced by the round-8 change. Fix this →
What's well done in this revision
Full-map check is correctly specified. reject when spread(m') >= spread(m) where spread = max(m) - min(m), computed over the full post-move count map m', is exactly the right gate. Any transfer that passes has spread(m') < spread(m) — which is strictly less than the pre-move spread by construction. The guard is O(nodes) per candidate (one slice copy + two integer mutations on the map built in §3.2), no new data structure required.
§5 property test "a transfer always strictly decreases the spread" is now correct. The round-12 review recommended changing this to "never increases" because the old source/target-only guard could admit spread-unchanged moves. With the full-map guard, that concern is resolved: the policy ONLY issues a transfer when spread(m') < spread(m), so any issued transfer strictly decreases global spread. The stronger "strictly decreases" claim holds, and the property test need not be weakened.
The 3-node source-fallthrough case in §5 (line 209) is still correct. A=5(pinned), B=4, C=0 (3 nodes) → B→C: post-move {A=5, B=3, C=1} → spread = 5 − 1 = 4. Strict decrease. Admitted. ✓ This is the scenario where the source-fallthrough (round-6) + full-map check (round-8) interact correctly: A has no eligible groups (pinned), B is the next source, and B→C passes the full-map check because there is no 4th node holding the global minimum at 0.
Companion case (c) in §5 ("A=5 all-pinned, B=4 eligible but its only candidate target D=4 would not strictly reduce the spread → fall through past B") correctly exercises the case where the single target is at the same count as the source. This is distinct from the 4-node tied-minimum case (two nodes at 0, both failing the gate) — together they cover the two ways B can have no eligible target.
The no_eligible_move metric label (§3.6, line 188) is correctly distinguished from below_threshold — the former fires when imbalance ≥ threshold but no strict-reducing move exists (e.g. the 4-node pinned-max case); the latter fires when the spread itself is below the threshold and no transfer is even considered. The label round-11 added is the right observability tool here.
All prior-round findings remain intact: saturating-threshold predicate (both underflow directions), voter-map zero-seed, conf-change guard under e.pending.Lock(), errLeadershipTransferInFlight unconditional guard, bool gated = 3/uint64 max_lag = 4 proto extension, mixed-version rollout rule, Progress.RecentActive replacing LastContact, TransferLeadershipToServerIfEligible engine method, SQS exclusion in PR2, eligibility-based two-pass default-group admission.
| Branch
…-12) Claude round-12 review on PR #953 (commit 99d8ea5): the §3.3 parenthetical claimed 'the policy must instead pick D as the target,' but D also fails the full-map spread check (post-move {A=5, B=3, C=0, D=1} has spread = 5 because C stays at 0). The correct outcome for the 4-node A=5-pinned, B=4, C=D=0 shape is no_eligible_move — neither C nor D can break the global spread when the other tied-min stays at 0. Updated the parenthetical to state the no_eligible_move outcome unambiguously and added the 3-node A=5-pinned, B=4, C=0 contrast (where B→C IS accepted, spread 5→4) so the test plan covers both the eligible-transfer and the no_eligible_move branches of the full-map gate.
|
@claude review — round 12 (1 doc correction on commit 99d8ea5):
|
|
Claude finished @bootjp's task in 10m 37s —— View job Review:
|
| Candidate | Pre-spread | Post-move map | Post-spread | Accepted? |
|---|---|---|---|---|
| B → C | 5 (A=5 − D=0) |
{A=5, B=3, C=1, D=0} |
5 (A=5 − D=0) |
Rejected ✓ |
| B → D | 5 (A=5 − C=0) |
{A=5, B=3, C=0, D=1} |
5 (A=5 − C=0) |
Rejected ✓ |
Both candidates fail the full-map check (spread(m') >= spread(m)). The iteration exhausts B's targets and emits no_eligible_move. The explanation in §3.3 — "no single transfer can reduce the global spread because both bottom nodes need to be lifted but only one transfer per cycle is allowed" — is correct.
3-node contrast verified (§5 line 209): A=5 pinned, B=4, C=0 → B→C: post-move {A=5, B=3, C=1}, spread = 5 − 1 = 4 < 5 → accepted ✓. No D at 0 to keep the global minimum pinned; the full-map check passes.
Minor note: §3.3 forward reference to §5 is not yet true
§3.3 line 93 states: "The §5 convergence test now exercises this exact A=5 pinned, B=4, C=D=0 shape as the no_eligible_move regression case AND the 3-node A=5-pinned, B=4, C=0 contrast where B→C IS accepted."
Verified against §5 (lines 208–218):
- 3-node contrast (B→C accepted): IS in §5 at line 209 as the main case of the "Source fall-through (codex round-6 P2)" bullet. ✓
- 4-node tied-minimum case (B→C AND B→D rejected →
no_eligible_move): is NOT in §5. A search forC=D=0,D=0, andno_eligible_move.*regressreturns only line 93 (§3.3), not any line in the §5 unit or integration test sections. The companion cases listed at §5 line 209 are (a), (b), (c) — there is no (d) for the 4-node shape.
The §5 "Source fall-through" companion cases do not include this shape. The integration "convergence test" (§5 line 216) is a 3-node end-to-end test (⌈N/3⌉ convergence) and cannot exercise a 4-node policy scenario as a named unit test case.
Two options to fix:
-
Add companion case (d) to the "Source fall-through" bullet in §5 (preferred — locks down the invariant for PR2 implementers reading §5 as their test spec):
"(d) 4-node tied-minimum (codex round-8 P2 + claude round-12 correction):
A=5(all-pinned),B=4,C=0,D=0, an eligibleB-led group with votersCandD→ tryB→C:{A=5,B=3,C=1,D=0}spread=5 (rejected —Dstays at 0); tryB→D:{A=5,B=3,C=0,D=1}spread=5 (rejected —Cstays at 0); no eligibleBtarget →no_eligible_move. Contrast: the 3-node main case (above) has no second bottom node to pin the global minimum, soB→Creduces spread 5→4 and is accepted." -
Update §3.3 line 93 to say "the PR2 unit tests will cover this exact shape" rather than claiming §5 currently covers it.
This is non-blocking for the proposal merge (tests are for PR2, and the §3.3 parenthetical itself is correct). But the forward reference should be corrected before PR2 starts so the test spec is accurate.
What's well done in this revision
§3.3 full-map check text is now complete and correct. The updated parenthetical states both failure cases (B→C leaves D=0 as global min; B→D leaves C=0 as global min), explains why neither reduces the global spread (two bottom nodes, one transfer per cycle), and gives the correct outcome (no_eligible_move). The "precisely the case the full-map check exists for" characterization is exactly right — the full-map check over the source/target-only guard is motivated by exactly this two-tied-minimum-with-pinned-max shape.
All prior-round findings remain intact. The saturating-threshold predicate (both underflow directions, both subtraction forms correctly rejected), voter-map zero-seed, conf-change guard under e.pending.Lock(), errLeadershipTransferInFlight unconditional guard, bool gated = 3/uint64 max_lag = 4 proto extension, mixed-version rollout rule, Progress.RecentActive replacing LastContact, TransferLeadershipToServerIfEligible engine method, SQS exclusion in PR2, eligibility-based two-pass default-group admission, and leaderbalance_unobservable_groups gauge all remain correctly stated.
| Branch
…minimum Claude round-13 review on PR #953 (commit 23c5a07) — minor forward-reference correction: §3.3 line 93 claimed §5 covers the 4-node tied-minimum case (A=5 pinned, B=4, C=D=0 → no_eligible_move), but §5's 'Source fall-through' companion cases only listed (a), (b), (c). Implementer reading §5 as the PR2 test spec would have missed this regression case. Added companion case (d) to the Source fall-through bullet: 4-node tied-minimum (A=5 all-pinned, B=4, C=0, D=0) → try B→C ({A=5,B=3,C=1,D=0} spread=5 rejected), try B→D ({A=5,B=3,C=0,D=1} spread=5 rejected), no eligible B target → no_eligible_move. Explicit contrast with the 3-node main case (no second bottom node, B→C reduces 5→4 accepted) so the test spec exercises both halves of the full-map check's two-tied-minimum-with-pinned-max regression.
|
@claude review — round 14 (1 non-blocking minor on commit 23c5a07):
All prior findings remain resolved per round-13 verification. |
|
Claude finished @bootjp's task in 5m 4s —— View job Review:
|
| Candidate | Pre-spread | Post-move map | Post-spread | Full-map check | Result |
|---|---|---|---|---|---|
| B → C | 5 (A=5 − D=0) |
{A=5, B=3, C=1, D=0} |
5 (A=5 − D=0) |
5 >= 5 → reject |
Rejected ✓ |
| B → D | 5 (A=5 − C=0) |
{A=5, B=3, C=0, D=1} |
5 (A=5 − C=0) |
5 >= 5 → reject |
Rejected ✓ |
Both targets exhaust the search on source B → no_eligible_move ✓
3-node contrast (main case in §5 line 209, not modified):
A=5 (pinned), B=4, C=0 → B→C: {A=5, B=3, C=1}, spread = 5 − 1 = 4 < 5 → accepted ✓. No D at 0 to keep the global minimum pinned; the 4-node shape is precisely the counterexample where the difference matters.
§3.3 forward reference now matches §5. The clause "The §5 convergence test now exercises this exact A=5 pinned, B=4, C=D=0 shape as the no_eligible_move regression case AND the 3-node A=5-pinned, B=4, C=0 contrast where B→C IS accepted" (line 93) is now satisfied: case (d) covers the former (via the unit test table) and the main Source fall-through case covers the latter. ✓
Attribution in case (d) — "(codex round-8 P2 + claude round-12/13 correction)": codex round-8 identified the need for a full post-move spread check; claude round-12 corrected the §3.3 parenthetical that wrongly said "the policy must instead pick D as the target" (D also fails); claude round-13 identified that the case was still missing from §5. All three attributions are accurate. ✓
Minor observation (no doc change required)
§3.3 labels the case as "§5 convergence test" when it is in the unit test section.
Line 93: "The §5 convergence test now exercises this exact A=5 pinned, B=4, C=D=0 shape..."
The 4-node tied-minimum case is in the unit test bullet (§5 first bullet, "Unit (table-driven, co-located *_test.go)"), not in the integration/convergence test (§5 third bullet, "Integration (the PR0 multi-voter-group topology, §1.1a)"). This label was already present before round 14 and is not a factual correctness issue (the test is in §5 and will exercise the shape). An implementer reading §5 will find the case regardless of which sub-section the cross-reference implies. Non-blocking; noting for completeness.
What's well done in this revision
Case (d) description is complete and precise. It states both rejection reasons explicitly ("D stays at 0, full-map check fails" for B→C; "C stays at 0" for B→D), names the correct outcome (no_eligible_move), and explicitly contrasts with the 3-node main case (which is accepted because there is no second bottom node). The closing sentence — "the 4-node shape exists precisely to exercise the full-map check's two-tied-minimum-with-pinned-max regression that the source/target-only guard would have admitted" — correctly states what the test guards against.
Property test claim "a transfer always strictly decreases the spread" (§5 line 210) is confirmed correct. The full-map check (reject when spread(m') >= spread(m)) guarantees that any issued transfer has spread(m') < spread(m) by definition. The stronger "strictly decreases" claim holds, the property test asserts new_spread < old_spread, and case (d) is consistent with this: the transfer is not issued (no_eligible_move), so the "strict decrease" invariant is never violated. The round-12 correction (promoting "never increases" to "strictly decreases" once the full-map check was added) remains accurate. ✓
All prior-round findings remain intact. Verified key elements:
- Saturating threshold (§3.5):
threshold = Commit > maxLag ? Commit - maxLag : 0; eligible = Match >= threshold— not either bare subtraction form. ✓ - Both underflow directions documented (young-group
Commit < maxLag; quorum->2Match > Commit). ✓ - Voter-map zero-seed before counting (§3.2). ✓
- conf-change guard under
e.pending.Lock()(§3.5, OQ-12). ✓ errLeadershipTransferInFlightunconditional guard beforerawNode.TransferLeader(§3.5, OQ-15). ✓bool gated = 3/uint64 max_lag = 4proto extension withgated-based routing (§3.4, OQ-4). ✓--leaderBalanceMaxTargetLag = 0is valid strictest gate, not a disable (§3.4, §3.6). ✓- Mixed-version rollout rule (§3.4, §3.6). ✓
Progress.RecentActivereplacingStatus.LastContact(§3.5). ✓TransferLeadershipToServerIfEligibleengine method keeping Progress internal (§3.5, OQ-13). ✓- Two-pass default-group eligibility (pass 1 non-default; pass 2 default only if pass 1 found no eligible move, §3.5 / OQ-3). ✓
leaderbalance_unobservable_groupscompanion gauge (§3.6). ✓no_eligible_movelabel in §3.6 metric enum, semantically distinct frombelow_threshold. ✓- OQ-3 updated to eligibility-based two-pass rule (§8). ✓
- Source fall-through policy in §3.3 (iterate in descending-count order, first eligible strict-reducing pair wins). ✓
- Forwarded-path case-2 integration test scenario (§5). ✓
The document is ready for implementation. All 15 open questions are resolved with concrete recommendations or explicitly deferred with justification.
| Branch
Issue 1: correct stale line anchors in kv/sharded_coordinator.go - keyviz sampler observeMutation: 1795-1824 -> 1841-1846 - RunHLCLeaseRenewal: 1914-1953 -> 1960-1985; defaultGroup access :1915 -> :1961 Issue 2: soften companion-doc references to in-flight PR #955 form, matching the #945/#951/#953 branch-reference style (3 sites + map line). Issue 3: ground OQ-1 in the actual commitTS logic (nextStartTS/ resolveTxnCommitTS/nextTxnTSAfter all from one c.clock) and annotate §4 step 11 as deferred-pending-OQ-1 with an explicit trigger condition. Inline: fix learner LinearizableRead behaviour (engine returns ErrNotLeader, caller forwards; engine.go:1583); note shared-startTS invariant; add OQ-7 for the single-node->multi-node live cutover / rolling-upgrade strategy.

Summary
Doc-only PR. Proposes a TiKV-style leader balance scheduler that automatically spreads Raft-group leaderships across nodes so one node does not end up leading every group (
docs/design/2026_06_11_proposed_leader_balance_scheduler.md). Per the repo's design-doc-first workflow, no implementation code is included.Goal (user's words): "TiKVのようにリーダー以外のノードに別のRaftグループのリーダーをおいて負荷を均等化したい" — put other groups' leaders on the non-leader nodes to even out the load. elastickv has no PD, so the equivalent of PD's
balance-leaderscheduler is embedded in the default-group leader.Key design points
RegisterLeaderLossCallback.State()/Leader()reads (every node runs every group), sibling to the existingpublishLeaderTermsloop — no peer-polling RPC. No-leader groups are skipped.>= 2(the spread-of-1 remainder must not trigger); deterministic source (most) / target (fewest) with node/group-ID tie-breaks; strict-spread-decrease guard; one transfer per cycle; per-group (30s) + global (10s) cooldowns on a monotonic clock; load-weighted balancing is explicit future work.TransferLeadershipToServer. A follower cannot initiate a transfer (handleTransferLeadershiprejects witherrLeadershipTransferNotLeader), so if the over-loaded source node is not the balancer host, the request is forwarded over the existing per-groupRaftAdmingRPC service to that group's leader.--leaderBalancePinGroups; default group is balanceable but the doc calls out the HLC-renewal restart + one lease-read blip; rely on cooldowns rather than a hard SplitJob interlock (M3 is not a dependency).--leaderBalance+ runtime kill switch (atomic), bounded-cardinality metrics (leaderbalance_*, fixed-enum/per-node labels only), structured slog (group_id,from,to, ...), and the existingraftadminCLI as the manual escape hatch.Open questions
htfifocapability and balance among capable nodes.RaftAdmingRPC vs. a purpose-built internal RPC.Test plan
Doc-only — no code, no tests. The doc itself specifies the future test strategy (pure-policy table tests,
rapidproperty invariants, 3-node convergence / anti-ping-pong integration, kill-switch + leader-change reset, and a Jepsen "no new anomalies with balancer on" run).Summary by CodeRabbit