diff --git a/docs/design/2026_06_11_proposed_leader_balance_scheduler.md b/docs/design/2026_06_11_proposed_leader_balance_scheduler.md new file mode 100644 index 00000000..7e6c3e41 --- /dev/null +++ b/docs/design/2026_06_11_proposed_leader_balance_scheduler.md @@ -0,0 +1,262 @@ +# Leader Balance Scheduler — spreading Raft-group leaderships across nodes + +Status: Proposed +Author: bootjp +Date: 2026-06-11 + +Sibling: [2026_06_11_proposed_hotspot_split_milestone3_automation.md](2026_06_11_proposed_hotspot_split_milestone3_automation.md) — shares the "scheduler lives on the default-group leader, default OFF behind a flag, runtime kill switch, bounded-cardinality metrics, leader-local state reset on election" conventions. This doc reuses those conventions but does **not** depend on M3 landing. + +## 1. Background + +### 1.1 The problem + +elastickv runs multiple Raft groups in one process (`--raftGroups id=addr,id=addr,…`, `shard_config.go:61-99`; default group is the lowest ID, `shard_config.go:386-397`). `buildShardGroups` iterates the parsed `groups` once per process and constructs a `raftGroupRuntime` per group (`main.go:786-889`), each with its own engine and its own gRPC listener at `rt.spec.address` (`main.go:1606-1620`). Leadership of each group is elected independently by etcd/raft, so there is **no mechanism today that spreads leaderships across nodes**. After a rolling restart, a partition heal, or simply unlucky election timing, one node can end up leading every group while its peers lead none. That node then carries all the leader-only work — write proposals, HLC ceiling renewal, lease reads, OCC timestamp issuance, route-catalog proposes — while the rest of the cluster sits idle. The user's explicit goal: "TiKVのようにリーダー以外のノードに別のRaftグループのリーダーをおいて負荷を均等化したい" — put other groups' leaders on the non-leader nodes so the load is even. + +### 1.1a Topology prerequisite — the multi-node multi-group voter set does not exist yet (PR0) + +The scheduler's whole objective — moving a group's leadership from an over-loaded node to a different node — **requires a group whose voter set spans more than one node**. That topology is **not deployable today**, and this is a hard prerequisite, not an implementation detail: + +- `--raftGroups` declares only `groupID=host:port` per group, where `host:port` is **this node's own listener** for that group; `groupSpec` carries a single `address`, not a per-group member list (`shard_config.go:14-17`, `:61-99`). +- `resolveBootstrapServers` **rejects** `--raftBootstrapMembers` whenever `len(groups) != 1` (`ErrBootstrapMembersRequireSingleGroup`, `main.go:742-748`). So a multi-node initial membership can only be declared for a *single-group* deployment. +- `buildRuntimeForGroup` passes the **same** `bootstrapServers` (which is `nil` in any multi-group config) to **every** group's `factory.Create` via `Peers`, with `LocalAddress: group.address` (`multiraft_runtime.go:246-254`). Each group in a multi-group process therefore bootstraps as a **single-member** cluster (just the local node). +- The M5 Jepsen runner records the same limitation in prose: today's `validateShardRanges` / `buildShardGroups` "only support a 'single process hosts all groups' model — separate processes per group fail validation or race on Raft listeners"; it launches "ONE process hosting BOTH **single-member** groups," and states "True distributed multi-group is M6+ work" (`scripts/run-jepsen-m5-local.sh:5-20`). + +**Consequence:** the premise "every node is a member of every group" holds only for the single-process multi-group demo, where every group has exactly one voter (the local node) — so `TransferLeadershipToServer` has **no other voter to transfer to**. There is no currently-deployable/testable topology in which the scheduler can move a leader between nodes. + +The in-tree primitives to *compose* such a topology at runtime already exist — `AddVoter` / `AddLearner` / `PromoteLearner` on the engine and the `RaftAdmin` service (`internal/raftengine/engine.go:225-233`, `internal/raftadmin/server.go:85-120`, `cmd/raftadmin/main.go:202-285`) — so a multi-node group *can* be built by adding voters to each group after bootstrap. But there is no startup wiring (`--raftGroups` peer lists / per-group `--raftBootstrapMembers`) that produces this topology directly, and no integration harness that stands up multi-voter groups across processes. + +This design therefore inserts an explicit topology milestone **PR0** ahead of the *transfer-issuing* scheduler PRs (see §4). **Recommendation (resolving OQ-9): option (a) — extend the bootstrap/flag surface** so `--raftGroups` (or a companion per-group members flag) accepts a multi-node voter set per group, lifting the `len(groups)==1` guard in `resolveBootstrapServers` (`main.go:742-748`). This makes the multi-node multi-group topology first-class at startup and avoids a manual `AddVoter` dance in every test harness; the `AddVoter`/`PromoteLearner` composition path (option (b)) remains valid for *live* topology expansion (membership changes after bootstrap) but is **not** the gating prerequisite. See §8 OQ-9 for the full rationale. + +Only the **transfer-issuing** scheduler PRs (PR2–PR3) are **blocked on PR0** — until a group can have voters on more than one node, there is nothing to transfer. **PR1 (observation + metrics only) is *not* blocked on PR0**: it issues no transfers, so it ships even against today's single-voter-per-group topology (where it observes a trivially-balanced map). **OQ-9** records the bootstrap-extension-vs-AddVoter-composition decision. + +### 1.2 TiKV / PD comparison + +In TiKV the Placement Driver (PD) runs a **balance-leader scheduler**: it periodically inspects how many region leaders each store holds and issues `TransferLeader` operators to move leaderships off the most-loaded store onto the least-loaded one, with operator limits and store-score hysteresis to avoid thrashing. The v1 objective is count-based (leader count per store), with later load/size weighting layered on. elastickv has no PD — there is no external control plane process. We therefore **embed the equivalent of PD's balance-leader scheduler inside the default-group leader** (the same seat that already hosts HLC ceiling renewal and the M3 auto-split scheduler), and drive transfers through the same `TransferLeadership` machinery PD uses on TiKV. + +### 1.3 What is already in-tree (the primitives we build on) + +- **Engine transfer API.** The `Admin` interface exposes `TransferLeadership(ctx)` and `TransferLeadershipToServer(ctx, id, address)` with a documented non-blocking / goroutine-offload contract (`internal/raftengine/engine.go:247-248`, and `RegisterLeaderAcquiredCallback` at `:249-271`). +- **etcd backend implementation.** `TransferLeadership` / `TransferLeadershipToServer` submit an admin request onto the single-threaded event loop (`internal/raftengine/etcd/engine.go:1317-1331`); `handleTransferLeadership` (`:1743-1771`) resolves the target, **rejects the call when the local node is not leader** (`errLeadershipTransferNotLeader`, `:1754-1757`), calls `rawNode.TransferLeader`, and surfaces an immediate `errLeadershipTransferRejected` **only if raft *silently dropped* the request** — i.e. the target is a learner, equals self, or has **no Progress entry at all** (`:1765-1768`). **Crucially it does NOT reject a voter that is merely behind:** etcd/raft v3.6 accepts a transfer to any non-learner, non-self voter with a Progress entry, sets `leadTransferee`, and `sendAppend`s until the target catches up — pausing proposals meanwhile (`go.etcd.io/raft/v3 raft.go:1631-1662`, `:1296-1299`; see §3.5). So this path has **no catch-up gate today** and a lagging target stalls writes for up to the transfer timeout — the gap §3.5 closes with a new engine-side `Progress.Match`/`RecentActive` guard. `waitForLeadershipTransfer` / `checkLeadershipTransfer` (`:1377-1439`) poll `Status.LeadTransferee` and fail closed if raft aborts the transfer (e.g. election timeout elapses before the target catches up). Default target selection picks the first non-self voter (`defaultTransferTargetLocked`, `:3852-3860`); named target selection rejects self (`namedTransferTargetLocked`, `:3862-3879`). +- **An automated-transfer precedent already in-tree.** `main_sqs_leadership_refusal.go` registers a per-group leader-acquired observer that *refuses* leadership of any group hosting a partitioned FIFO queue when the binary lacks the `htfifo` capability, by calling `TransferLeadership` on acquire. It demonstrates the exact patterns this design needs: idempotent refusal (a second call is a no-op once a transfer is in flight, `:104-109`), goroutine-offload of the blocking admin call from the non-blocking callback (`:88-95`), and the TOCTOU-safe install (read `State()`, fire if already leader, register, re-check, `:97-120`). It is wired in `main.go:439-442` via `installSQSLeadershipRefusalAcrossGroups` (`main_sqs_leadership_refusal.go:178-217`), which iterates every `raftGroupRuntime` and installs the hook per group hosting a partitioned queue. +- **Per-group remote transfer over gRPC.** Each group's engine is exposed on its own listener via `internalraftadmin.RegisterOperationalServicesWithInterceptor(ctx, gs, rt.engine, …)` (`main.go:1610`; the server is engine-generic, one engine per `RaftAdmin` service, `internal/raftadmin/server.go:13-47`, `internal/raftadmin/health.go:33-52`). `cmd/raftadmin/main.go` is the operator client (`leadership_transfer`, `leadership_transfer_to_server`, `:210-217`, `:359-378`). This is the existing in-tree mechanism for invoking a transfer on a group whose leader is a *different* node. +- **Per-group leader observation precedent.** `startKeyVizLeaderTermPublisher` / `publishLeaderTerms` (`main.go:2088-2143`) already iterate `runtimes` on a ticker and read each group engine's `Status()` (term) through the `snapshotEngine()` accessor (`multiraft_runtime.go:41-48`). The same iteration over `rt.engine.State()` / `rt.engine.Leader()` (`internal/raftengine/engine.go:131-138`) gives the balancer its observation input for free. +- **Leader-local reset hook.** Coordinator subsystems already reset on leadership change via `LeaseProvider.RegisterLeaderLossCallback` (`kv/coordinator.go:131`, `kv/sharded_coordinator.go:584`), the same mechanism the lease invalidation uses. + +## 2. Goals and Non-Goals + +### 2.1 Goals + +1. Automatically spread Raft-group leaderships across nodes so no single node leads a disproportionate share of groups (TiKV count-based balance-leader, v1). +2. **Default OFF** behind `--leaderBalance`, with a runtime kill switch and bounded-cardinality metrics. +3. Deterministic, unit-testable policy: given a leader-count map, the decision is a pure function. +4. Never fight existing leader-pinning policy (SQS leadership refusal); never transfer to an unhealthy / lagging follower; never act mid-conf-change. +5. Reuse the in-tree transfer mechanism (`TransferLeadership(ToServer)` and the per-group `RaftAdmin` gRPC service) — no new wire surface for v1. + +### 2.2 Non-Goals + +1. **Load- or size-weighted balancing.** v1 balances *leader count* only. Weighting by keyviz per-group load (the same signal M3 consumes, `keyviz/sampler.go`) is explicit future work (§7, future PR), not v1. +2. **Raft group creation / membership orchestration *by the balancer*.** The balancer only moves leadership among the voters a group already has; it never adds/removes members (that is `raftadmin add_voter` / `remove_server`). Note: standing up the multi-node voter set the balancer needs *is* a prerequisite (PR0, §1.1a) — but that membership work is a separate, balancer-independent milestone, not something the running scheduler performs. +3. **Replica (peer) rebalancing / data movement.** Moving where a group's *replicas* live is out of scope; only *leadership* moves. (Region/replica balance is TiKV's separate balance-region scheduler; not in scope here.) +4. **Cross-process coordination beyond Raft.** The balancer runs in-process on the default-group leader; there is no external PD. +5. **Heterogeneous group membership across nodes.** The scheduler assumes every node that participates in the default group is also a member (voter or learner) of every other Raft group — i.e. **homogeneous group membership across nodes**. The observation loop (§3.2, the balancer reads each co-hosted group's local `Leader()`/`Configuration()`) and the case-2 forward path (§3.4, the balancer dials the source group's leader) both depend on this: if some groups lived on a subset of nodes that excludes the default-group leader (e.g. a 5-node cluster where groups G3–G5 have a 3-node voter set not including the balancer host), the balancer would have no local view of those groups and no member seat to observe or forward from. Deployments with heterogeneous group membership are **out of scope for v1**; the homogeneous topology is exactly what PR0 (§1.1a) stands up. + +## 3. Design questions answered + +### 3.1 Who balances (decision: the default-group leader) + +The scheduler runs **only on the current default-group leader**, for three reasons that all already hold for other control-plane work: + +- It is the natural control-plane seat: HLC ceiling renewal already runs there (`ShardedCoordinator.RunHLCLeaseRenewal`, `kv/sharded_coordinator.go:1914-1948`; the proposer that issues the ceiling is leader-gated), and route-catalog proposes go through the default group (`distribution/`, CLAUDE.md "Control plane"). +- The M3 auto-split scheduler is placed there too (sibling doc §4.1, §7.6) — co-locating the leader balancer keeps "one scheduler seat per cluster" rather than electing yet another coordinator. +- There is exactly one default-group leader at a time, so the scheduler is singleton without extra leader election. + +**State reset on election.** All balancer state (per-group cooldown deadlines, the global cooldown deadline, the last observed leader map) is **leader-local, in-memory, not Raft-replicated**. On a default-group leadership change the deposed leader's scheduler goroutine stops (via `RegisterLeaderLossCallback`, `kv/coordinator.go:131`) and the new leader starts with empty state. The worst case of a lost cooldown is a *too-soon* extra transfer (mildly wasteful, self-correcting next cycle), never an unsafe action — so non-replication is the right cost/safety trade (same rationale as M3 §7.6). We state this explicitly so it is not mistaken for an oversight. + +**Startup grace period (anti-storm on balancer-host churn).** The state reset above has a sharp edge: because the balancer runs *on* the default-group leader, any churn of the default group's own leadership relocates the balancer to a fresh host with **empty cooldown state**, which would otherwise let the new balancer fire a transfer on its very first cycle. If the default-group leader is flapping (transient network instability, CPU starvation, or — recursively — a balancer-initiated transfer of the default group itself, §3.5), each newly elected leader would re-evaluate from scratch and could trigger another transfer, producing a **transfer storm** that keeps the cluster from stabilizing. To break this loop, the scheduler arms a **startup grace deadline** the moment it acquires the balancer role (default `--leaderBalanceStartupGrace`, default = `max(leaderBalanceGlobalCooldown, one leaderBalanceInterval)` ≈ a few election timeouts): until that deadline passes the loop **observes and emits metrics but issues no transfer**. The grace deadline uses the same `internal/monoclock` source as the cooldowns (§3.3). This guarantees the cluster has settled after a default-group election before the fresh balancer makes its first move, regardless of why leadership moved. The grace is re-armed on every acquisition of the balancer role, so a flapping default leader can never accumulate enough stable time to fire. A skip during the grace window is counted under `leaderbalance_skipped_total{reason=startup_grace}`. + +### 3.2 Observation (decision: local per-group `State()` / `Leader()`, no polling RPC) + +Once the PR0 topology (§1.1a) is in place so each group has voters on more than one node, the default-group leader is itself a member (leader or follower) of every group it co-hosts and can read **the local engine's view of who leads each group** with zero network cost: iterate `runtimes`, and for each `rt.snapshotEngine()` read `State()` and `Leader()` (`internal/raftengine/engine.go:131-138`). (In the single-process multi-group demo this still works, but each group has exactly one voter, so the count map is trivially balanced and no transfer is ever issued — the loop is observe-only there.) This is exactly what `publishLeaderTerms` already does for term (`main.go:2126-2143`); the balancer adds a sibling reader for leader identity. From the per-group `Leader().ID` the scheduler builds the **leader-count map** `nodeID → number of groups this node currently leads`. + +- **Seed the count map with every voter at zero before counting leaders.** This is the crux that makes the policy fire on the very imbalance the scheduler exists to fix. If the map were built *only* from observed `Leader().ID` values, then the worst-case starting state — node A leads every group, nodes B and C lead none (after a rolling restart or unlucky election timing, §1.1) — produces `{A: N}` and omits B and C entirely. `max(count) - min(count)` is then `N - N = 0` (B/C are invisible), so the `>= imbalanceThreshold` trigger (§3.3) never fires and the policy can never pick B or C as the under-loaded target. The scheduler must therefore **first seed the map with count 0 for every voter node ID across all co-hosted groups**, then increment from observed leaders. The voter set is read locally with zero RPC: for each co-hosted group call `rt.snapshotEngine().Configuration(ctx)` (`internal/raftengine/etcd/engine.go:1242-1250`, returns the local cached `ConfState`-derived `Configuration{Servers []Server}` on **every** node, leader or follower) and take each `Server` whose `Suffrage == "voter"` (`SuffrageVoter`, `internal/raftengine/etcd/peers.go:27`). The union of those voter IDs across all groups is the seed key set (learners are excluded — they cannot become leaders and `handleTransferLeadership` rejects them as targets, §3.5). Without this seed the PR2 convergence test (force all leaders onto one node, §5) cannot pass, because the under-loaded nodes would never appear as candidate targets. +- **Unknown / no-leader groups are skipped (but their voters still seed the map).** A group whose local engine reports `Leader().ID == ""` (election in flight, or this node has no fresh contact) is *excluded from leader counting* for this cycle — the scheduler never acts on a group it cannot see a stable leader for — yet its voters are still part of the zero-seed above (so a node that currently leads nothing because all its groups are mid-election is still a visible target). This is the cheapest correct source; it relies only on the local Raft state the node already maintains, and a group mid-election is exactly the group we must not perturb. +- **Why not poll peers** (a `GetClusterOverview`-style fan-out, `adapter/admin_grpc.go:138-149`): unnecessary, because local Raft state already names each group's leader on every member, and a fan-out would add latency, a configured-endpoint dependency, and a fresh source of staleness. We stay local. (A future load-weighted policy that needs per-group load from *non-leader* nodes could reuse the keyviz cluster fan-out behind the same policy interface — noted in §7, mirrors M3 OQ-5.) + +### 3.3 Policy (decision: count-based, one transfer per cycle, hysteresis + cooldowns) + +The policy is a **pure function** `(leaderCountMap, eligibility, config, now) → at most one TransferDecision{groupID, fromNode, toNode}`. Pure-function shape makes the table-driven tests trivial (§8) and the leader-local reset a matter of dropping state. + +- **Objective:** minimize the spread of leader counts across nodes. The ideal per-node count is `⌈groups / nodes⌉` (ceil) at most. +- **Imbalance trigger (hysteresis):** act only when `max(count) - min(count) >= imbalanceThreshold` (default **2**). A spread of 1 is the unavoidable remainder when `groups` is not divisible by `nodes` and must never trigger a transfer — otherwise the scheduler ping-pongs forever against the arithmetic. A threshold of 2 is the smallest value that is provably stable at the optimum. +- **Source / group / target choice (deterministic, with fall-through to the next eligible source — resolves codex round-6 P2 "fall back to the next eligible source node"):** the policy does **not** stop at the single most-loaded node. It **iterates candidate source nodes in descending leader-count order** (tie-break: lexicographically smallest node ID, so the order is reproducible in tests) and, for each, looks for an eligible `(group, target)` pair that **strictly reduces the spread**; it takes the **first** such pair and stops. This is the fix for the case where the absolute-max node has *no* transferable group — all its leaderships are pinned (§3.5), in per-group cooldown (§3.3), or mid-conf-change (§3.5) — yet the **second**-most-loaded node has an eligible move that still reduces the spread. The earlier "pick the single max node and skip the cycle if it has no eligible group" rule wrongly left the cluster imbalanced in exactly that case (e.g. `A=5` all-pinned, `B=4`, `C=0` → moving a `B`-led group to `C` drops the spread from 5 to 4, but the max-only rule never considers `B`). Concretely, for each candidate source `s` taken in descending-count order: + - **Group to move** = among the groups led by `s` that are *eligible* (§3.5: not pinned, not in per-group cooldown, no conf-change in flight, no transfer already in flight), pick deterministically (tie-break by group ID ascending). + - **Target node** = among nodes that are **voters of that group**, **healthy** (§3.5), and **not policy-excluded for that group** (§3.5), the one with the **fewest** leaders (tie-break: smallest node ID). The candidate move is accepted only if it **strictly reduces the global spread** (codex round-8 P2 — earlier source/target-only rule `targetCount + 1 > sourceCount - 1` missed cases where a third node holds min or max): compute the **full post-move count map** `m'` by applying the candidate (`m'[source] = m[source] - 1`, `m'[target] = m[target] + 1`, all others unchanged) and reject when `spread(m') >= spread(m)` where `spread = max(m) - min(m)`. Example: `A=5` (all pinned), `B=4`, `C=0`, `D=0` → moving a B-led group to C is admitted by the source/target-only rule (`1 > 3` is false), but the post-move map `{A=5, B=3, C=1, D=0}` still has `spread = 5` (A - D), unchanged. The full-map rule rejects this move because it does **not** strictly reduce the spread. **Both candidate targets fail (claude round-12 correction)**: B→D gives `{A=5, B=3, C=0, D=1}`, spread = 5 (A - C); C stays at 0, keeping the global minimum pinned at 0. So neither C nor D is a valid target this cycle — the iteration exhausts B's targets and falls through to `no_eligible_move`. The two-tied-minimum shape with a pinned global max is precisely the case the full-map check exists for: no single transfer can reduce the global spread because both bottom nodes need to be lifted but only one transfer per cycle is allowed. The scheduler correctly defers until either A's pin lifts, the default-group pass (§3.5) offers an alternative, or natural cluster activity changes the count map. Implementation: the post-move map is the same `leaderCountMap` the policy already builds in §3.2; `m'` is one slice copy plus two integer mutations, no new state. The §5 convergence test now exercises this exact `A=5 pinned, B=4, C=D=0` shape as the `no_eligible_move` regression case AND the 3-node `A=5-pinned, B=4, C=0` contrast where B→C IS accepted (spread 5→4, no second bottom node to keep the global minimum pinned). + - If no `(group, target)` for `s` yields a strict-spread-reducing move, advance to the next source candidate. The first source with such a pair wins; if **no** source has one, the cycle issues no transfer (counted in the appropriate skip metric, §3.6). This preserves **one transfer per cycle** and full determinism — only the *order in which sources are tried* is added; tie-breaks at every level are unchanged. (The default-group "balance last" rule of §3.5 still applies, but now as a **two-pass** wrapper around this iteration: pass 1 considers only non-default groups; pass 2 admits the default group **only if pass 1 found no eligible reducing move**. This eligibility-based admission — not a raw-count threshold — is what makes the rule progress-preserving when non-default groups are pinned / refusing / in cooldown / mid-conf-change. See §3.5 for the full rule.) +- **One transfer per cycle.** At most one `TransferLeadershipToServer` is issued per evaluation cycle. A cluster-wide imbalance is corrected over several cycles, never in a burst (mirrors TiKV's operator limit and M3's `maxSplitsPerCycle`). +- **Per-group cooldown** (default **30 s**, ≥ a few election timeouts): after a transfer is issued for a group, that group is ineligible until the cooldown expires, so a single group is not bounced repeatedly while the new leader settles. +- **Global cooldown** (default **10 s**) after *any* transfer: gives the cluster time to re-stabilize and the next observation to reflect the completed move before another is scheduled. +- **Anti-ping-pong with natural elections:** cooldowns + the spread-strictly-decreases guard + the `>=2` threshold together ensure the scheduler does not chase a natural re-election (which momentarily changes counts) into an oscillation. Cooldown deadlines use a **monotonic clock** (`internal/monoclock`) so a wall-clock step cannot shorten/extend them (CLAUDE.md: no ordering-sensitive wall-clock use). + +### 3.4 Mechanism (decision: the balancer host issues the transfer; follower-initiated transfer is NOT possible, so requests to groups the host does not lead are FORWARDED over the per-group `RaftAdmin` gRPC service) + +The transfer must be initiated **on the current leader of the target group**: `handleTransferLeadership` rejects with `errLeadershipTransferNotLeader` whenever the local engine is not in `StateLeader` (`internal/raftengine/etcd/engine.go:1754-1757`). So a follower **cannot** initiate a transfer — this is a hard constraint, not a preference. + +The balancer host (the default-group leader) is a member of every group but is, in general, a **follower** of the group it wants to rebalance (indeed, the whole point is to move leadership *off* the over-loaded node, which may or may not be the balancer host). Two sub-cases: + +1. **The over-loaded source node is the balancer host itself** (it leads the target group). Then the balancer calls the gated `engine.TransferLeadershipToServerIfEligible(ctx, target.ID, target.Address, maxLag)` (§3.5) directly on the **local** runtime's engine for that group — the local engine is the leader, so its event-loop guard applies the `Progress.Match`/`RecentActive` catch-up + liveness check before transferring, and the goroutine-offload + idempotency patterns from `main_sqs_leadership_refusal.go:88-95` apply verbatim. +2. **The over-loaded source node is a *different* node.** The balancer host is a follower of that group and cannot initiate the transfer locally. It **forwards** the request to the source node's leader of that group, using the in-tree per-group `RaftAdmin` gRPC service: dial the source node's group listener (`rt.spec.address` for that group is in the group's `Configuration`, available via `engine.Configuration(ctx)`, `internal/raftengine/engine.go:213-215`) and call `RaftAdmin.TransferLeadership` with `TargetId`/`TargetAddress` set to the chosen target **and the new `gated = true` flag set with `max_lag = --leaderBalanceMaxTargetLag`** (see the proto extension below). The receiving node's engine is the leader of that group, so `handleTransferLeadership` accepts it — and because `gated = true` the handler routes to the **gated** `TransferLeadershipToServerIfEligible` (§3.5), so the remote leader's engine-side catch-up/liveness guard runs there (with `max_lag = 0` meaning "require the target fully caught up," the strictest gate). + +**Forward-path routing mechanism — decision: extend `RaftAdminTransferLeadershipRequest` with an explicit `bool gated` flag plus a `uint64 max_lag` field (resolves OQ-4 + codex round-6 P2 "preserve gated routing when max lag is zero").** Today `RaftAdminTransferLeadershipRequest` carries only `target_id = 1` and `target_address = 2` (`proto/service.proto:248-251`, `syntax = "proto3"` at `proto/service.proto:1`), and the handler maps unconditionally to the **un-gated** `TransferLeadershipToServer` (`internal/raftadmin/server.go:163`). There is no field on the wire that can carry the balancer's `maxLag` to the remote leader, so the case-2 forwarded path has **no way to reach the gated `…IfEligible` engine method** without a wire change. PR2 adds **two** new fields (the next free field numbers) to `RaftAdminTransferLeadershipRequest`: `bool gated = 3` and `uint64 max_lag = 4`. + + **Why a separate `gated` bit rather than overloading `max_lag == 0` (codex round-6 P2).** The earlier draft routed `max_lag == 0` to the un-gated method and `max_lag > 0` to the gated one. That makes `--leaderBalanceMaxTargetLag = 0` a **footgun**: the flag is named "max target lag," so an operator (or future maintainer) reading it expects `--leaderBalanceMaxTargetLag = 0` to mean *"require the target fully caught up — zero lag allowed,"* the **strictest** gate. Overloading the proto3 zero-value as the "ungated" sentinel inverts that: `0` would have **disabled** the gate entirely and silently allowed a transfer to an arbitrarily lagging target. Zero is a legitimate strict lag budget, so it must not double as the "no gate" signal. The fix separates the two concerns onto two fields: + - `gated = false` (the proto3 default, also what every existing client sends) ⇒ the **un-gated** operator path; `max_lag` is ignored. + - `gated = true` ⇒ the **gated** path with whatever budget `max_lag` carries — and now `max_lag = 0` correctly means the **strictest** gate (`threshold = Commit`, i.e. require `Match >= Commit`, the target must be fully caught up), exactly matching the flag name. The saturating-threshold predicate of §3.5 still applies unchanged: `threshold = Commit > maxLag ? Commit - maxLag : 0`, which at `maxLag = 0` reduces to `threshold = Commit`. + + This is the chosen resolution. The proto3-explicit-presence alternative (`optional uint64 max_lag = 3`, using `Has_max_lag` to distinguish unset/operator-legacy from set-to-0/strict-gate) is supported by the pinned toolchain (`libprotoc 29.3` / `protoc-gen-go v1.36.11`, proto3 explicit presence is stable since protobuf 3.15 / protoc-gen-go 1.27) but is **not** how the repo expresses optionality today — no `.proto` in-tree uses an `optional` field, so the generated `Has_*` accessor / pointer-deref ergonomics would be a first. The explicit `bool gated` carries the intent directly and reads unambiguously at the handler and in tests, so PR2 uses it; this is purely a wire-encoding choice and does not affect the engine-side guard. + + - **Backward compatible by proto3 construction.** Both new fields are absent on any existing client — every `cmd/raftadmin` call builds the request with only `TargetId`/`TargetAddress` set (`parseTransferTarget`, `cmd/raftadmin/main.go:370-378`) — so they decode to `gated = false` / `max_lag = 0` and route to the un-gated path. Old and new binaries interoperate in both directions; no `cmd/raftadmin` change is required. (A pre-`gated` server treats *both* new fields as unknown proto3 fields and drops them; see the mixed-version note below — that path executes ungated, which is the conservative outcome.) + - **Single routing rule in the server handler.** `RaftAdmin.TransferLeadership` (`internal/raftadmin/server.go:155-172`) routes on `req.Gated`: when `req.Gated` it calls the **gated** `TransferLeadershipToServerIfEligible(ctx, target.ID, target.Address, req.MaxLag)` (where `req.MaxLag == 0` is the legitimate strictest budget); when `!req.Gated` (the default for every operator call) it keeps calling the **un-gated** `TransferLeadershipToServer` exactly as today. + - **Operator path stays ungated — stated explicitly (§3.4 behavioral impact).** A `cmd/raftadmin leadership_transfer_to_server
` invocation leaves both fields unset (`gated = false`), so it preserves the legacy un-gated behavior: an operator performing disaster recovery can still **force** leadership onto a lagging follower (e.g. the only surviving node, which may be far behind). The catch-up gate applies **only** to the balancer's automated transfers, which always set `gated = true` and `max_lag = --leaderBalanceMaxTargetLag`. **Field absence — not `max_lag == 0` — is the only ungated signal**, so an operator who set `--leaderBalanceMaxTargetLag = 0` gets the strictest gate (the balancer still sets `gated = true`), never an accidentally-disabled one. This is the property that makes the gate safe to add without regressing the manual operator escape hatch. + +**Mixed-version forward safety during a rolling upgrade (resolves codex round-5 P2).** The proto3 "absent field ⇒ default" rule cuts the *unsafe* way too: an **old** `RaftAdmin` server that predates the `gated = 3` / `max_lag = 4` fields treats them as **unknown fields** and silently discards them (proto3 keeps them only as opaque unknown-field bytes; the generated Go handler never reads them). Its handler (`internal/raftadmin/server.go:155-172`) still routes unconditionally to the **un-gated** `TransferLeadershipToServer`. So if a new balancer forwards a `gated = true` request to a group leader still running the *old* binary, the old leader executes the transfer **ungated** — to a possibly far-behind target — while the balancer believes it requested a gated transfer that the executing leader would refuse if the target were lagging. The "old and new binaries interoperate" claim above is therefore true only for the *wire decode*, not for the *gating semantics*; the forwarded path must not rely on it. The fix has two layers, the first structural and the second defensive: + + - **Structural ordering (primary, no new machinery).** The gated handler ships in **PR2** (the engine method + the `gated`/`max_lag` proto fields + the server-handler route), and the **only** code that ever sets `gated = true` is the balancer, which is also introduced in PR2 behind `--leaderBalance` (default OFF). There is no separate "balancer enabled later than the engine" window: a binary that can *emit* `gated = true` is by construction the same binary that *honors* it on receive. The forwarded transfer only fires when `--leaderBalance` is true on the sending node, so the operator turning it on is, by the same act, attesting they have rolled the gated binary out. This makes the safe-vs-unsafe distinction a **deployment-ordering** property rather than a runtime negotiation — the same posture the encryption stage uses for its mutator RPCs (rolling-restart discipline: "don't enable any mutator RPC until every member of every Raft group reports capable", `main_encryption_admin.go:42-49`; `docs/design/2026_04_29_partial_data_at_rest_encryption.md` 6A caveat). + - **Explicit rollout rule (operator-facing).** Stated as a hard precondition in the milestone plan and ops section: **do not set `--leaderBalance=true` on any node until every node in the cluster runs a binary that includes the PR2 gated `RaftAdmin.TransferLeadership` handler.** Enabling the balancer mid-rollout (some nodes old, some new) risks a forwarded transfer landing ungated on an old leader. This mirrors the encryption rolling-upgrade caveat exactly and needs no new wire surface. + - **Defensive capability pre-check (belt-and-suspenders, reusing the in-tree fan-out precedent).** Because operator discipline can be violated, the balancer additionally performs a **capability check before forwarding** to a remote leader the first time in a balancer lifetime it would target that node, modeled on the encryption `GetCapability` Voters∪Learners fan-out (`adapter/encryption_admin.go:304-359`, `main_encryption_fanout.go`) and the SQS `htfifo` poller (`adapter/sqs_capability_poller.go:120`): advertise a `transfer_gate` capability bit (or a build-SHA / feature-flag entry) on the per-node capability report and have the balancer **skip the forward (counted `leaderbalance_skipped_total{reason=peer_ungated}`) when the source-group leader does not advertise it**, falling back to no-op + retry next cycle rather than issuing a transfer that would execute ungated. The exact capability surface (extend `CapabilityReport`, or add a lightweight `RaftAdmin.GetCapability`) is the subject of **OQ-14**; v1 may ship the structural ordering + rollout rule alone (which is sufficient when the rule is followed) and add the pre-check in PR3 if reviewers want defense-in-depth. The capability check, if added, also subsumes the per-node `htfifo`-publication question OQ-2 raises for SQS-refused groups — the same per-node capability report can carry both bits. + +This keeps the `cmd/raftadmin` operator surface and the automated balancer surface routing through one RPC and one handler, with the gate selected by data (the field value) rather than by a new method or a new service — avoiding both a new wire surface and the implicit assumption that all cluster nodes share the same `--leaderBalanceMaxTargetLag` value (the balancer's local value travels on the request). The proto regeneration is version-pinned (`protoc 29.3` / `protoc-gen-go v1.36.11` / `protoc-gen-go-grpc 1.6.1`, `proto/Makefile:1-3`); `make gen` must run under that toolchain. The forward path is otherwise a thin gRPC client (mirror the dial/credentials handling in `cmd/raftadmin/main.go:72-105`; reuse the admin connection cache that `startAdminFromFlags` already threads, `main.go:1238`). + +**Security posture of the forward path (current state + requirement).** The forwarded transfer reaches a peer over its per-group `RaftAdmin` gRPC service. The relevant in-tree fact is that **`RaftAdmin` is unauthenticated today**: the `--adminTokenFile` bearer-token interceptor (`adapter.AdminTokenAuth`) gates **only** methods under the `/Admin/` prefix and explicitly passes through every other service — including `/RaftAdmin/` — without a token check (`adapter/admin_grpc.go:446-449`, `:484`, `:498`; the test at `adapter/admin_grpc_test.go:570` asserts a non-`/Admin/` method bypasses the gate). The only interceptor `RaftAdmin` carries is the optional `MembershipChangeInterceptor`, which is the encryption writer-registry **pre-register hook on AddVoter/AddLearner**, not authentication (`internal/raftadmin/interceptor.go:5-39`, `main.go:1610`). Two consequences: + - Contrary to a naïve reading, enabling `--adminTokenFile` does **not** break the balancer's forwarded `TransferLeadership` calls — that token does not apply to `/RaftAdmin/`, so no credential needs to be threaded into the forward client for v1 to function. + - But that is because `RaftAdmin` (including `TransferLeadership`, `AddVoter`, `RemoveServer`) is reachable **without any authentication**. The balancer does not change this posture, but it does make automated, programmatic use of it the steady-state path, so the design states the requirement explicitly: **`RaftAdmin` listeners must be confined to a trusted network boundary** (the same boundary that already protects the inter-node Raft transport), since any client that can reach a group listener can already move that group's leadership or membership. If `RaftAdmin` is later put behind authentication (its own bearer token or mTLS — tracked under **OQ-4**, the operator-vs-automated surface split), the forward client must then present that credential exactly as `cmd/raftadmin` would; the dial helper this design reuses (`cmd/raftadmin/main.go:72-105`) is the place that credential plumbing lands. **OQ-11** records the RaftAdmin authentication gap as a precondition to harden before the scheduler runs on an untrusted network. + +> Note on `TransferLeadershipToServer` vs. `TransferLeadership`: the design always uses the **targeted** form so the destination is the balancer's chosen least-loaded node, not raft's default "first non-self voter" (`defaultTransferTargetLocked`, `internal/raftengine/etcd/engine.go:3852-3860`), which would not respect the balance objective. Specifically the balancer uses the **catch-up-gated** variant `TransferLeadershipToServerIfEligible` introduced in §3.5 (not the bare `TransferLeadershipToServer`, which has no catch-up gate and would stall writes on a lagging target). The bare un-gated form is not removed: it remains the engine method that the `RaftAdmin.TransferLeadership` handler selects when `gated` is false (the manual operator path, `cmd/raftadmin`, which leaves the field unset). So both methods coexist behind one RPC, with the request's `gated` flag selecting between them — and `max_lag = 0` under `gated = true` is the strictest gate, not the ungated route (§3.4 proto extension). + +### 3.5 Safety / exclusions + +A group is **eligible** for a transfer this cycle only if **all** of the following hold; otherwise it is skipped (and counted in a skip metric): + +- **No conf-change in flight — enforced on the executing leader, not only at the balancer's decision time.** Skip a group whose membership is changing — a transfer racing a conf-change can land on a member about to be removed. Two distinct enforcement points are required, because the balancer's decision-time read and the transfer's execution can happen on different nodes (case 2, §3.4): + - **Decision-time observability (resolves OQ-5).** Surface the existing per-group pending-config state through the exported `Status`: the etcd backend already tracks `pendingConfigs` (`map[uint64]adminRequest`, guarded by `e.pending`, `internal/raftengine/etcd/engine.go:390`, `:563`, populated in `storePendingConfig` `:3138`), but it is not on the public `raftengine.Status` struct (`internal/raftengine/engine.go:67-83`). Add a `PendingConfChange bool` field to `Status` populated from `len(e.pendingConfigs) > 0` (mirroring how `LeadTransferee` was added to the struct), rather than leaking etcd internals via `rawNode.BasicStatus().Config.PendingConfIndex` past the engine boundary or adding a dedicated `Admin` method. This lets the balancer skip a group whose leader it *can* see has a pending conf-change. In **case 1** (the balancer host leads the group) this read is local and authoritative. + - **Execution-time guard on the group leader (required for case 2).** Decision-time `Status` is **not sufficient for the forwarded path**: in case 2 the balancer is a *follower* of the target group, so its local `Status().PendingConfChange` reflects only the local node and not the remote leader's `pendingConfigs`. And the current `RaftAdmin.TransferLeadership → TransferLeadershipToServer → handleTransferLeadership` path checks only leader state and raft's transfer acceptance (`internal/raftengine/etcd/engine.go:1743-1771`) — it does **not** consult `pendingConfigs`. Surfacing the field in `Status` alone therefore does not stop a forwarded transfer from racing an Add/Remove on the remote leader. PR2 must add a **server-side guard on the executing leader**: `handleTransferLeadership` rejects the transfer with a new `errLeadershipTransferConfChangePending` when `len(e.pendingConfigs) > 0`, before calling `rawNode.TransferLeader`. **The read must acquire `e.pending.Lock()` (claude round-3 #2):** running on the event loop is *not* by itself enough to make `e.pendingConfigs` consistent for a reader, because `cancelPendingConfig` deletes entries under `e.pending.Lock()` from a **non-event-loop goroutine** — the `ctx.Done()` arm of `submitAdminEx` (`internal/raftengine/etcd/engine.go:1366-1367`, `cancelPendingConfig` at `:3158-3165`). Reading `len(e.pendingConfigs)` without the lock is a data race under `-race`. The guard therefore snapshots `hasPending := len(e.pendingConfigs) > 0` under `e.pending.Lock()` and rejects when true. The same locking applies to the `PendingConfChange bool` field on `Status` (OQ-5): the `refreshStatus`/`Status()` snapshot path must read `len(e.pendingConfigs)` under `e.pending.Lock()` before writing the cached `e.status`. The balancer maps the rejection to `leaderbalance_skipped_total{reason=conf_change}` and retries next cycle. This makes the conf-change exclusion enforceable regardless of which node executes the transfer, which fits the §3.4 ownership rule (the executing leader owns the safety gate). **Unlike the catch-up/liveness gate (which is selected by the gated engine method — i.e. by `gated = true` on the wire — so it touches only the balancer path), the conf-change guard is *unconditional* — it applies to every transfer regardless of method or `maxLag`, including the operator `cmd/raftadmin` path.** That asymmetry is deliberate and is *not* a re-run of the catch-up-gate contradiction: a transfer racing a membership change can land leadership on a member about to be removed, which is a correctness hazard for *any* caller — there is no legitimate operator reason to force a transfer into an in-flight conf-change, whereas there *is* a legitimate operator reason to force a transfer onto a lagging follower (disaster recovery). So the conf-change guard correctly has no opt-out, while the catch-up gate correctly does (via `gated = false` / the un-gated method — **not** via `maxLag = 0`, which under the gated method is the strictest budget). **OQ-12** records the guard placement (in `handleTransferLeadership` vs. a check in the `RaftAdmin.TransferLeadership` server wrapper). +- **No transfer already in flight for the group — enforced on the executing leader, like the conf-change guard.** A group that already has a leadership transfer settling must not be perturbed by a second transfer to a *different* target. Two enforcement points are required, for the same reason as the conf-change guard (decision-time read and execution can happen on different nodes, case 2 §3.4): + - **Decision-time observability.** Skip a group whose leader reports `Status().LeadTransferee != 0` (`internal/raftengine/engine.go:78-82`, populated on the leader, `etcd/engine.go:1431`). **This is authoritative only in case 1** (the balancer host leads the group and reads its own non-zero `LeadTransferee` locally). In **case 2 it is a no-op**: `Status().LeadTransferee` is populated from `rawNode.BasicStatus().LeadTransferee` (`etcd/engine.go:2908`), which is the leader's `r.leadTransferee` and is **identically `0` on a follower** (`internal/raftengine/engine.go:78-82`: "non-zero on the current leader … zero otherwise, including on followers"). The balancer is a *follower* of the case-2 target group, so its local `Status().LeadTransferee` is always `0` regardless of the remote leader's state, and this decision-time skip cannot see a transfer in flight on the remote leader. + - **Execution-time guard on the group leader (required for case 2).** The corrected mechanism — round-9 (codex P2) found the round-8 wording was wrong in two ways: + - **etcd/raft does NOT reject a second transfer to a different target — it *aborts the in-flight one and starts the new one*.** Verified against the in-tree `go.etcd.io/raft/v3 v3.6.0`: when `leadTransferee != None` and the new `MsgTransferLeader` names a *different* target, `(*raft).Step` logs "abort previous transferring" and proceeds to set `r.leadTransferee` to the new target (`raft.go:1637-1645` / `:1655`) — it returns early only when the new target equals the *current* transferee (`:1639-1642`). So a second `TransferLeadershipToServer` to a different valid voter silently cancels the first transfer. **The round-8 claim that this "would hit `errLeadershipTransferRejected` (`etcd/engine.go:1765-1768`)" is false:** after raft restarts the transfer to target `B`, `rawNode.BasicStatus().LeadTransferee == B == target.NodeID`, so the post-call check at `:1765` (`LeadTransferee != target.NodeID`) is **false** and **no error is returned** — that error fires only when raft *silently drops* the request outright (target is a learner, self, or has no `Progress` entry), not for an abort-and-restart. The first caller's `waitForLeadershipTransfer` then sees `LeadTransferee` change off its target and returns `errLeadershipTransferAborted` (`etcd/engine.go:1424`/`:1435-1436`) — the first transfer has been canceled. **The round-8 claim that "in case 2 the remote leader enforces it at submit time" is therefore also false:** `handleTransferLeadership` has no in-flight-transfer guard today (`internal/raftengine/etcd/engine.go:1743-1771` checks only leader-state and raft's accept), so a forwarded `gated = true` transfer can silently abort an operator's `cmd/raftadmin leadership_transfer_to_server` or another system-initiated transfer and extend the write-stall window. + - **The guard PR2 must add.** Before `rawNode.TransferLeader` (`etcd/engine.go:1758`), `handleTransferLeadership` reads `e.rawNode.BasicStatus().LeadTransferee` and, when it is non-zero, rejects with a new `errLeadershipTransferInFlight` (a retryable error the balancer maps to a logged skip + retry next cycle) instead of issuing a transfer that would abort the in-flight one. `BasicStatus()` is **already** called one line earlier on the event loop for the leader-state check (`:1754`), and the same `BasicStatus().LeadTransferee != 0` read is already used race-free on the loop in the proposal/read drop paths (`etcd/engine.go:1567`, `:1593`) — so this read needs **no new lock** (unlike the conf-change guard, whose `pendingConfigs` is mutated off the loop by `cancelPendingConfig`). + - **Scope: unconditional, applied to both the gated and ungated paths** (same asymmetry as the conf-change guard — *not* keyed on the `gated` method marker, unlike the catch-up/liveness gate). Rationale: silently canceling another caller's in-flight transfer is almost never intentional and the cost of rejecting is low (wait for the current transfer to settle or time out, then retry); making it unconditional also closes the case-1 TOCTOU window between the decision-time `LeadTransferee` read and execution. **Justification for not exempting the operator path:** an operator who *intends* to override an in-flight transfer is already served — they disable the balancer (kill switch, §3.6) and the in-flight transfer is the balancer's own; an operator racing a *system* transfer they did not initiate is exactly the silent-cancel hazard this guard prevents, and the retryable rejection (settle-then-retry) is the correct behavior even for `cmd/raftadmin`. The rejected alternative — gate only the `gated = true` balancer path and keep `cmd/raftadmin` able to abort an in-flight transfer — would leave the balancer's own automated transfers cancelable by a racing operator force-transfer *and* still let `cmd/raftadmin` silently cancel a balancer transfer; the unconditional form is strictly safer and the operator's legitimate override is the kill-switch path, not abort-by-force. (Contrast the catch-up/liveness gate, which *is* gated-only because there is a legitimate operator reason to force-transfer onto a lagging follower — disaster recovery — whereas there is no legitimate reason to *silently cancel* a peer's in-flight transfer.) + - The balancer maps the rejection to `leaderbalance_skipped_total{reason=transfer_in_flight}` and retries next cycle. **OQ-15** records the guard placement and the unconditional-vs-gated scope decision. +- **Target follower is caught up / healthy — validated *only* on the group leader, by an engine-side guard, never by raft's own transfer-accept logic and never by the balancer's local `Status`.** This is the single most important safety gate, and round-3 review (codex P1/P2, claude #1) found the round-2 wording got the mechanism wrong in two ways. The corrected analysis: + - **etcd/raft does NOT reject a lagging target — it accepts the transfer and *stalls writes*.** Verified against the in-tree `go.etcd.io/raft/v3 v3.6.0`: `(*raft).Step` on `MsgTransferLeader` (`raft.go:1631-1662`) returns early **only** when the target is a learner (`:1632-1634`), is the local node (`:1647-1649`), or already has a transfer in progress to the same node (`:1639-1642`). For **any other voter — including a far-behind one — it sets `r.leadTransferee = leadTransferee` (`:1655`) and `sendAppend`s to catch the target up (`:1660`)**, completing only once `pr.Match == lastIndex` (`:1656-1658`). While `leadTransferee != None`, **the leader drops every new proposal with `ErrProposalDropped` (`raft.go:1296-1299`)** — i.e. writes on that group stall until the target catches up or one election timeout elapses and raft aborts the transfer. So the round-2 assumption that `TransferLeadershipToServer` would return `errLeadershipTransferRejected` for an insufficiently-caught-up target is **false**: the in-tree `handleTransferLeadership` rejection check (`internal/raftengine/etcd/engine.go:1765-1768`) fires only when raft *silently dropped* the transfer (learner / self / no Progress entry at all), **not** for a voter that is merely behind — for a behind voter `BasicStatus().LeadTransferee == target.NodeID` is true, the check passes, and the leader has already paused proposals. A lagging remote target therefore stalls writes for up to the transfer timeout. The catch-up gate must run **before** `rawNode.TransferLeader`, on the executing leader, and it does not exist today. + - **Only the active leader holds the per-peer `Progress` map; followers cannot see it.** In etcd/raft the `Progress` map (`Match`/`Next`/`RecentActive`) lives only on the active leader (`go.etcd.io/raft/v3 status.go:25` — "The Progress is only populated on the leader") and is reached in-tree via `e.rawNode.Status().Progress` (used by `handlePromoteLearner`, `internal/raftengine/etcd/engine.go:1674-1688`). The exported `raftengine.Status` struct (`internal/raftengine/engine.go:67-83`) **does not expose `Progress` at all**. So the balancer host cannot locally pre-filter targets by `Progress.Match` for any group it does not lead (the case-2 majority of transfers, §3.4) — its local engine for that group is a follower with no Progress. + - **`Status.LastContact` is the WRONG signal for target liveness and is identically 0 on a leader (codex P2 / claude #1).** Round 2 specified a case-1 liveness gate of `Status.LastContact < electionTimeout/2`. That is ineffective: `lastContactFor` returns `0` unconditionally when the local engine is `StateLeader` (`internal/raftengine/etcd/engine.go:3481-3483`), and the field it derives from is a **single** value (`e.lastLeaderContactFrom` / `e.lastLeaderContactAt`, set in `recordLeaderContact`, `:3540-3546`) describing *this follower's contact with ITS leader* — there is no per-peer leader→follower tracking. In a case-1 transfer the balancer host **is** the leader, so its own `Status().LastContact` is always `0`, `0 < electionTimeout/2` is trivially true, and the gate would pass even for a partitioned/dead target. The correct per-peer leader-side liveness signal is **`Progress.RecentActive`** (`go.etcd.io/raft/v3 tracker/progress.go:87-91` — "true if the progress is recently active … reset to false after an election timeout"), which is in the leader-only `Progress` map alongside `Match`. + - **Ownership rule (unchanged, now exhaustive):** target catch-up *and* liveness validation is the **responsibility of the node that executes the transfer — which is always the group leader** (case 1: the balancer host itself; case 2: the remote source-group leader the request is forwarded to, §3.4). The group leader is the only node that holds the live `Progress` map. The balancer host does **not** assert "target X is caught up / live"; it expresses a **load preference** (the least-loaded eligible voter, or an ordered preference list least-loaded→most-loaded) and the executing leader applies catch-up + liveness as the final, authoritative gate. This holds for **both** sub-cases — there is no longer a separate "case-1 pre-filter on `Status`" path, because the only place the signal exists is inside the engine on the leader. + - **Concrete eligibility predicate, enforced by a new engine-side guard `handleTransferLeadership` runs *before* `rawNode.TransferLeader`** (mirroring how `handlePromoteLearner` already gates on `Progress.Match`, `internal/raftengine/etcd/engine.go:1674-1688`). For the resolved `target.NodeID`, read `e.rawNode.Status().Progress[target.NodeID]` on the event loop and require **all** of: + - **Caught up — explicit saturating threshold is the recommended form (resolves codex round-5 P1 + round-6 follow-up):** the lag check must be written so it never underflows `uint64` **in either direction**. The eligibility predicate is: **eligible iff `progress.Match >= threshold`, where `threshold = status.Commit - maxLag` if `status.Commit > maxLag`, else `threshold = 0`.** Implement it with the explicit saturating branch — `var threshold uint64; if status.Commit > maxLag { threshold = status.Commit - maxLag }; eligible = progress.Match >= threshold` — and **do not** write either subtraction form directly. Both naïve one-liners underflow `uint64` in a different regime: + - `progress.Match >= status.Commit - maxLag` underflows when `status.Commit < maxLag` — the steady state of any *young* group whose commit index has not yet reached the default `1024`-entry budget (a fresh cluster, a freshly-split group, a low-traffic group). `status.Commit - maxLag` wraps to a value near `2^64`, so the predicate rejects **every** target, including one perfectly caught up (`Match == Commit`), and the balancer is permanently stuck on a fresh/low-traffic cluster. + - `status.Commit - progress.Match <= maxLag` underflows in the **opposite** direction, when `progress.Match > status.Commit` — which **can stably occur** for a real follower in any cluster with quorum > 2 (a 5-node group, quorum = 3). In etcd/raft v3.6 the leader updates `Progress[i].Match` the moment it processes a follower's `MsgAppResp` for index N (`stepLeader` calls `pr.MaybeUpdate(m.Index)` then `r.maybeCommit()`, `go.etcd.io/raft/v3 raft.go`), but `CommitIndex` advances only once a quorum holds N. So if the leader and follower A have both replicated entry N=100 while followers B, C, D have not yet acknowledged it, then `Progress["A"].Match = 100` while `Commit` stays at e.g. 80 — a **stable** intermediate state (it persists until two more followers catch up), not a transient one. There `status.Commit - progress.Match` is `uint64(80) - uint64(100) ≈ 2^64 - 20`, which is `> maxLag`, so the subtraction form **incorrectly rejects A** — a follower that is actually *more* caught up than the committed state and the best possible transfer target. In a 5-node cluster under sustained writes, if every candidate happens to have `Match > Commit` at once, that form rejects **all** of them and the balancer skips the group for a whole cycle. (In a 3-node cluster, quorum = 2, so `Commit` advances in the same event-loop step as any one follower's `MaybeUpdate` and the problematic state does not arise — but PR0's topology and the §5 integration tests are generic across cluster sizes, so the guard spec must be correct for quorum > 2.) + + The two subtraction one-liners are therefore **not equivalent** to each other or to the saturating form, and `Match <= Commit` is **not** a universal invariant — it is false exactly in the quorum > 2 case above. The saturating-threshold form is the only one correct for all cluster sizes: it never subtracts when it would wrap (`threshold = 0` when `Commit <= maxLag`), and it compares `Match >= threshold` so a follower with `Match > Commit` is correctly accepted (it trivially clears any threshold `<= Commit`). The guard never runs against self (`resolveTransferTarget` rejects self, §3.5), so the only `Match` values it sees are followers'. Units are log entries; default **`--leaderBalanceMaxTargetLag = 1024` entries**. On the executing leader both `progress.Match` and the commit index are local reads. A transfer to a target more than `maxLag` behind the committed index is rejected before it can pause proposals. **`maxLag = 1024` is a conservative default that keeps the transfer window short at moderate append throughput; it is NOT a fixed "one election timeout" of log — at high write rates 1024 entries is a small fraction of an election timeout (and would reject most targets), at low rates it is many.** Operators running high-throughput workloads should raise it toward roughly `raftElectionTimeoutTicks × tickInterval × raftMaxEntriesPerMsg` entries (the log a healthy follower can fall behind within one election timeout at that throughput); tighter values risk never finding an eligible target under steady write load, looser values risk a longer write stall on the new leader's group while raft catches the target up. Note the young-group case makes the gate **permissive** on a fresh cluster (`threshold = 0`, so any live voter clears it), which is correct: there is no large backlog to catch up on, so any live voter is a safe target — the danger the gate exists to prevent (a target tens of thousands of entries behind) only arises once the log is long. + - **Recently active (liveness):** `progress.RecentActive == true`. This is the correct per-peer leader-side liveness signal (reset to false after an election timeout of silence), replacing the round-2 `Status.LastContact` gate, which is unusable on a leader (above). It needs no separate flag. + - **Learner / self / unknown rejected** by the existing path: `resolveTransferTarget` rejects self (`namedTransferTargetLocked`, `:3862-3879`) and unknown peers, raft drops learners/no-Progress targets (`raft.go:1632-1634`), surfaced as `errLeadershipTransferRejected` (`:1765-1768`). + On failure the guard rejects **before** `rawNode.TransferLeader` with a new `errLeadershipTransferTargetNotCaughtUp` (so no proposals are ever paused). This guard runs on the executing leader in **both** sub-cases — for case 1 the balancer's *own* engine applies it; for case 2 the *remote* source-group leader applies it. **The catch-up/liveness gate is selected by *which engine method* is called, not by the numeric `maxLag` value — so it never gates the operator path, and `maxLag = 0` under the gated method is the strictest gate, not a disable signal.** Both `TransferLeadershipToServer` (un-gated) and `TransferLeadershipToServerIfEligible` (gated) submit the same `adminActionTransferLeadership` request onto the single event-loop handler `handleTransferLeadership` (`internal/raftengine/etcd/engine.go:1325-1331`, `:1743-1771`). The `IfEligible` variant threads `maxLag` **and an "apply gate" marker** onto the `adminRequest`, and `handleTransferLeadership` runs the `Progress.Match`/`RecentActive` check **whenever that marker is set** (i.e. for every `…IfEligible` call), using the saturating threshold so `maxLag = 0` ⇒ `threshold = Commit` ⇒ require `Match >= Commit` (fully caught up). The bare `TransferLeadershipToServer` carries **no** gate marker, so the catch-up gate is skipped and the operator force-transfer path (`cmd/raftadmin`) keeps its current behavior — exactly the property the §3.4 `gated = false` proto route guarantees end-to-end. (Keying on the method/marker rather than on `maxLag > 0` is what lets `maxLag = 0` mean "strictest gate" instead of "no gate" — the codex round-6 P2 fix carried through to the engine boundary.) `--leaderBalanceMaxTargetLag` is the only tunable; the round-2 `--leaderBalanceMaxTargetLastContact` flag is **dropped** (its underlying `Status.LastContact` signal does not work on a leader). + - **How the guard reads `Progress` — decision (resolves the round-2 "read it inside the engine or a small accessor" hand-wave, claude #3): keep `Progress` internal; add a thin engine method, do not export the map.** Add `TransferLeadershipToServerIfEligible(ctx, targetID, address string, maxLag uint64) error` to the engine `Admin` interface (`internal/raftengine/engine.go`); its etcd backend runs the underflow-safe **saturating-threshold** lag check (`threshold = Commit > maxLag ? Commit - maxLag : 0; eligible = Match >= threshold` — never either bare subtraction form, both of which wrap on `uint64`: `Match >= Commit - maxLag` on a young group, `Commit - Match <= maxLag` when `Match > Commit` in a quorum > 2 cluster — see the "Caught up" bullet above) + `RecentActive` check on the single-threaded event loop inside `handleTransferLeadership` and then transfers, returning `errLeadershipTransferTargetNotCaughtUp` on failure. This keeps the leader-only `Progress` map and the "is `RecentActive` a good-enough liveness proxy" question behind the engine boundary, prevents any caller from constructing a stale eligibility decision from an exported map, and means **case 1 and case 2 share one engine method** (case 1 calls `TransferLeadershipToServerIfEligible` on the local engine; case 2 forwards over `RaftAdmin.TransferLeadership` with `gated = true`, which the server handler routes to the *same* engine method on the remote leader — §3.4). Exporting `Progress` on `Status` is the rejected alternative (a wider, harder-to-evolve public contract). The pre-existing `TransferLeadershipToServer` (no catch-up gate) stays as the engine method behind the `gated = false` route, i.e. the manual operator path (`cmd/raftadmin`, which leaves both fields unset) keeps its ungated force-transfer behavior; the balancer always sets `gated = true` and so always lands on the `IfEligible` form (with `maxLag = --leaderBalanceMaxTargetLag`, where `0` is the strictest budget, not a disable). **This Progress exposure therefore lands in PR2 at the guard, not as a case-1-only local pre-filter** — there is no separate case-1 exposure to plan. + - **Concretely per sub-case:** + - **Case 1 (balancer host leads the group):** the balancer calls `engine.TransferLeadershipToServerIfEligible(...)` on the **local** engine; the engine-side guard (above) is the catch-up/liveness gate. No reliance on `Status.LastContact`. + - **Case 2 (forwarded to a remote leader):** the balancer cannot see the remote leader's `Progress`, pending conf-change, *or* in-flight transfer locally — all three are enforced on the executing leader (the balancer's follower-local `Status().LeadTransferee` / `Status().PendingConfChange` are identically the local node's, not the remote leader's). It forwards its chosen target (and, in PR3, optionally an ordered preference list) to the source-group leader, which applies the engine-side catch-up/liveness guard, the conf-change guard, and the in-flight-transfer guard. The executing leader rejects (1) a transfer racing a membership change via the conf-change guard of the "No conf-change in flight" bullet above (`errLeadershipTransferConfChangePending`), (2) a transfer racing an already-in-flight transfer via the in-flight guard (`errLeadershipTransferInFlight`, so the forwarded request can never silently abort an operator's or another system's in-flight transfer — §3.5 "No transfer already in flight"), and (3) a too-far-behind / not-recently-active target via the engine-side guard (`errLeadershipTransferTargetNotCaughtUp`) *before* `rawNode.TransferLeader`, plus the existing `errLeadershipTransferRejected` backstop for a target raft drops outright (`:1765-1768`); `waitForLeadershipTransfer` fails closed if the transfer later aborts (`:1384-1405`, `:1431-1437`). The balancer maps these outcomes to a logged skip + `leaderbalance_skipped_total{reason=conf_change|transfer_in_flight|no_healthy_target}` / `leaderbalance_transfers_failed_total{reason=rejected|aborted}` and tries again next cycle. + - The forward path therefore needs an `IfEligible`-equivalent on the remote leader. The current `RaftAdmin.TransferLeadership` RPC accepts a single `TargetId`/`TargetAddress` (`internal/raftadmin/server.go:155-172`) and maps to the **un-gated** `TransferLeadershipToServer` — there is no field on the wire to carry the gating intent or `maxLag` (`proto/service.proto:248-251`). PR2 resolves this by **extending the proto with a `bool gated = 3` flag and a `uint64 max_lag = 4` field** (decision in §3.4, resolving OQ-4 + codex round-6 P2): the server handler routes to the gated `TransferLeadershipToServerIfEligible(ctx, id, addr, req.MaxLag)` when `req.Gated` and keeps the un-gated `TransferLeadershipToServer` when `!req.Gated`. The balancer's case-2 forward client always sets `gated = true` with `max_lag = --leaderBalanceMaxTargetLag` (so the remote leader gates; `max_lag = 0` is the strictest budget, not a disable); the manual operator path (`cmd/raftadmin`) leaves both fields unset (`gated = false`), so it stays **ungated** and a force-transfer to a lagging follower remains possible. There is exactly **one** routing path — the `gated` flag selects the gated vs. ungated engine method; the design does **not** carry a second "change the existing handler unconditionally" sub-option (which would have silently gated `cmd/raftadmin`), nor a separate internal RPC. v1 forwards exactly one preferred target and relies on the remote leader's engine-side guard + submit-time rejection for safety. The **ordered preference list** (so the leader can fall through to the next-least-loaded healthy voter in one round trip) is a future refinement that would add a repeated preference field to the same request; it is recorded under **OQ-10**, with the single-target + guard + reject-and-retry behavior as the correct, shippable v1. +- **Not policy-pinned.** A group **must never** be balanced onto a node whose policy *refuses* leadership of it: + - **SQS leadership refusal.** A node lacking the `htfifo` capability refuses leadership of any group hosting a partitioned FIFO queue (`main_sqs_leadership_refusal.go:69-121`, wired from `--sqsFifoPartitionMap` via `partitionedGroupSet`, `main_sqs_leadership_refusal.go:136-156`, `shard_config.go:174-196`). If the balancer transferred such a group onto a refusing node, that node would immediately transfer it away again — a guaranteed ping-pong. The balancer must therefore **know which (group, node) pairs are refusing** and exclude them as transfer targets, and exclude refusing groups from being *moved onto* such nodes. **How it knows:** the same configuration that drives the refusal hook — the partitioned-group set (`partitionedGroupSet`) plus each node's advertised `htfifo` capability — is the source of truth. v1 takes the conservative correct stance: for any group in the partitioned-FIFO set, the balancer **excludes that group from balancing entirely** unless it can positively confirm every candidate target advertises `htfifo`. Since capability is per-binary and not currently published per-node in a balancer-readable form, v1 simply **excludes partitioned-FIFO groups from leader balancing** (they are already governed by the refusal hook). **OQ-2** asks whether to instead publish per-node `htfifo` capability (e.g. via the existing capability fanout, `main.go` `encryptionCapabilityFanout` / `CapabilityReport`) so those groups *can* be balanced among the htfifo-capable subset. + - **Future pinning flag.** Reserve a `--leaderBalancePinGroups` (and/or per-group pin) exclusion list so an operator can pin a group's leadership (e.g. to a node with special hardware) and keep the balancer off it. (Design the config now; ship the flag with the scheduler PR.) +- **Default group is balanceable, with a called-out blip.** The default group is *not* excluded — leaving it pinned to one node defeats half the point (that node also carries HLC/catalog work). But moving the default group's leadership has two transient effects the operator must understand: + - **HLC ceiling renewal** restarts on the new leader. The ceiling is proposed every `hlcRenewalInterval` (1 s) with a `hlcPhysicalWindowMs` (3 s) window (`kv/coordinator.go:37-46`, `:644-669`); because the window (3 s) exceeds the renewal interval (1 s) and a newly elected leader clamps `Next()` to `max(wall, ceiling)`, a clean transfer does **not** let the new leader issue a timestamp inside the old leader's window — the safety invariant holds across a transfer exactly as it does across a natural election. + - **Lease reads:** a transfer invalidates the lease on the old leader (`RegisterLeaderLossCallback → lease.invalidate`, `kv/coordinator.go:131`, `kv/sharded_coordinator.go:584`), so the new leader's *first* read takes the slow `LinearizableRead` path (one ReadIndex round-trip) before its lease warms again — a single read-latency blip, not a correctness issue. This is identical to the blip on any natural election. The cooldowns keep these blips rare. + - **Recommendation (resolving OQ-3): balance the default group LAST — gated by "no eligible non-default reducing move exists this cycle", not by other groups' source counts (resolves codex round-7 P2).** The default group is only admitted to the eligible-source set on a **second pass** of the §3.3 candidate iteration, executed only when the first pass — restricted to non-default groups — found no `(source, group, target)` tuple that strictly reduces the spread. Concretely each cycle runs: + 1. **First pass — non-default only.** Run the §3.3 descending-source-count iteration over candidate source nodes, considering only groups whose `group_id != defaultGroupID`. If any source yields an eligible (group, target) pair (§3.5 conf-change / in-flight / catch-up / liveness / not policy-pinned, all unchanged), issue exactly that transfer and stop (single transfer per cycle is preserved). + 2. **Second pass — default group admitted.** Only if the first pass returned **no eligible reducing move**, repeat the descending-source-count iteration with `group_id == defaultGroupID` *also* eligible. If this pass finds a default-group move that strictly reduces the spread, issue it. Otherwise issue no transfer (counted under `leaderbalance_skipped_total{reason=no_eligible_move}`). + - **Why this replaces the round-6 wording.** The earlier rule — "every other group's source node is already at or below `⌈groups / nodes⌉`" — gated default-group admission on a property of *raw source counts*, but those counts include groups that are *categorically ineligible to move this cycle* (pinned via `--leaderBalancePinGroups`, refusing leadership via the SQS `htfifo` exclusion, in per-group cooldown, in a conf-change, or with a transfer already in flight). A node leading the default group plus several pinned non-default groups stays above `⌈groups / nodes⌉` indefinitely because none of those non-default groups can ever move; the default group then is **never** admitted even though moving it would strictly reduce the spread. Example: `A = {default, P1, P2, P3, P4}` all five pinned-or-default, `B = ∅`, `C = ∅`, `ceil = 2`. First-pass non-default iteration finds nothing eligible (P1..P4 all pinned). Second-pass default admission then moves `A → default → B` (spread `5 → 4`), then on subsequent cycles either A's count drops further or A stabilizes at `1` because nothing else can move — the optimum given the pins. Eligibility (not raw counts) is the correct gate, because eligibility is what determines whether progress is *possible* this cycle. + - **Why this is still "balance last."** The default group is still touched only after the non-default candidate set is exhausted *for this cycle*; in the steady state where non-default moves are available, default-group cycles never run and the two transient effects above (HLC-renewal restart + one lease-read blip, plus the §3.1 recursive balancer-host churn risk) are still deferred. The only difference vs. round-6 is that "non-default candidate set is exhausted" is now defined by eligibility — which is what the §3.3 inner loop is already computing — rather than by an indirect count proxy that pins falsify. An operator who wants the default group pinned entirely keeps `--leaderBalancePinGroups` (§3.6) as the unambiguous opt-out. This is strictly safer than "balance first" and strictly more useful than "pin by default" (which leaves the HLC/catalog node permanently over-loaded) and now also strictly more progress-preserving than "raw-count gate" (which deadlocks under heavy pinning). See §8 OQ-3. +- **In-flight hotspot split (M2/M3) interaction — decision: rely on cooldowns, do not add an explicit interlock in v1.** A leadership transfer during an in-flight `SplitJob` (M2) does not corrupt the job — the migrator runs on the default-group leader and is itself reset/resumed on leadership change (M3 §7.6 / M2 resumability). Adding a hard "pause balancing during any active SplitJob" interlock would couple this scheduler to M2/M3's job state, which the doc-first plan keeps decoupled (M3 is not a dependency). v1 therefore relies on the per-group and global cooldowns to keep transfer churn low enough not to disturb a split; if production shows interference, a follow-on can add the interlock behind the shared scheduler interface. **OQ-6** records this for reviewers. + +### 3.6 Ops + +- **Default OFF behind `--leaderBalance`** (bool, default `false`). When false, no scheduler goroutine starts; behavior is byte-identical to today. When true, the default-group leader runs the observe→decide→transfer loop on `--leaderBalanceInterval` (default e.g. 30 s). +- **Rolling-upgrade precondition for `--leaderBalance` (codex round-5 P2).** **Do not set `--leaderBalance=true` on any node until every node in the cluster runs a binary that includes the PR2 gated `RaftAdmin.TransferLeadership` handler** (the one that honors the `gated = 3` / `max_lag = 4` fields). Rationale: a forwarded transfer (§3.4 case 2) carrying `gated = true` to a group leader still on an *old* binary is silently executed **ungated** — the old handler treats both new fields as unknown proto3 fields and drops them, then routes to the un-gated `TransferLeadershipToServer`, so the catch-up gate the balancer expects never runs on the executing leader. This is the same rolling-restart discipline the encryption mutator RPCs require ("don't enable any mutator RPC until every member of every Raft group reports capable", `main_encryption_admin.go:42-49`). The structural ordering (the balancer and the gated handler ship in the same PR2 binary, §3.4) makes single-binary skew impossible; this rule covers the *cross-node* heterogeneous-cluster window. An optional `transfer_gate` capability pre-check (OQ-14) can enforce it automatically — the balancer skips the forward (`leaderbalance_skipped_total{reason=peer_ungated}`) to any leader that does not advertise the gated handler. +- **Tuning flags:** `--leaderBalanceInterval`, `--leaderBalanceImbalanceThreshold` (default 2), `--leaderBalancePerGroupCooldown` (30 s), `--leaderBalanceGlobalCooldown` (10 s), `--leaderBalanceStartupGrace` (default `max(globalCooldown, interval)`; the post-acquisition warm-up of §3.1), `--leaderBalanceMaxTargetLag` (default 1024 log entries; the executing-leader catch-up gate enforced inside the engine, §3.5 — tune up for high-throughput deployments per §3.5; **`= 0` is a valid, strict setting — it requires the target be fully caught up, `Match >= Commit`, since the balancer always sends `gated = true` and the gate is keyed on that flag, not on `max_lag == 0` (codex round-6 P2, §3.4)**), `--leaderBalancePinGroups` (comma-separated group IDs to exclude; also the operator opt-out for the default group, §3.5 OQ-3). (The round-2 `--leaderBalanceMaxTargetLastContact` flag is **removed**: target liveness is now the leader-side `Progress.RecentActive` signal, not `Status.LastContact`, which is identically 0 on a leader, §3.5.) +- **Runtime kill switch:** the loop checks an `atomic.Bool` each cycle; an admin RPC / endpoint (reuse the existing admin surface that hosts keyviz/distribution operator calls, `adapter/admin_grpc.go`) toggles it without a restart. Flipping it off stops new transfers at the next cycle boundary; an already-issued transfer completes (it is a single raft operation). Mirrors M3 §7.2. +- **Metrics (bounded cardinality, per-node, fixed-enum labels only — no per-group/per-node-pair labels):** + - Counters: `leaderbalance_transfers_attempted_total`, `leaderbalance_transfers_succeeded_total`, `leaderbalance_transfers_failed_total{reason}` where `reason ∈ {not_leader, rejected, aborted, rpc_error, timeout}`; `leaderbalance_skipped_total{reason}` where `reason ∈ {below_threshold, cooldown, conf_change, transfer_in_flight, no_healthy_target, pinned, sqs_refused, in_cooldown_global, startup_grace, peer_ungated, no_eligible_move}` (`peer_ungated` only when the optional capability pre-check of OQ-14 is enabled; `no_eligible_move` is the §3.5 two-pass-exhausted outcome — imbalance ≥ `imbalanceThreshold` but every candidate `(source, group, target)` combination is pinned / in cooldown / mid-conf-change / unhealthy / refused, so no strict-spread-reducing transfer exists this cycle. Semantically distinct from `below_threshold` which means the spread itself is `< imbalanceThreshold` so no transfer is even considered). + - Gauges: `leaderbalance_enabled` (0/1); `leaderbalance_leaders_per_node{node}` — **bounded** because node count is small and operator-fixed (unlike route count); `leaderbalance_unobservable_groups` (count of groups skipped from leader counting this cycle because their local engine reported no stable leader, §3.2) — a companion to `leaders_per_node` so that a "balanced" gauge sum can be distinguished from a "partially observed" one (without it, several groups mid-election simultaneously make `leaders_per_node` silently sum to fewer than the actual group count, making the two states indistinguishable); `leaderbalance_eval_duration_seconds` (last cycle wall time, diagnostic). (If even per-node cardinality is a concern in very large clusters, expose `leaderbalance_leader_count_spread` — the single max-min number — instead; **OQ-7**.) +- **Structured slog** with stable keys per CLAUDE.md: `group_id`, `from`, `to`, `from_count`, `to_count`, `spread_before`, `spread_after`, `decision ∈ {transfer, skip, cooldown}`, `reason`. Exactly one line at the decision point and one at the transfer result. +- **Manual escape hatch already exists and stays un-gated.** `cmd/raftadmin leadership_transfer_to_server
` lets an operator move leadership by hand at any time (`cmd/raftadmin/main.go:212-217`, `:359-378`); the balancer does not remove that path, and an operator can disable the balancer (kill switch) and drive transfers manually. Because that CLI leaves the new `gated`/`max_lag` fields unset (`gated = false`, §3.4), the operator transfer keeps its **un-gated** behavior — a disaster-recovery force-transfer onto a lagging follower (e.g. the only surviving node) is **not** blocked by the catch-up gate, which applies only to the balancer's `gated = true` automated calls. (The conf-change guard, by contrast, is unconditional and does apply to the operator path — there is no safe reason to force a transfer into an in-flight membership change, §3.5.) + +## 4. Milestones / PR breakdown + +| PR | Scope | Tests | Independently shippable? | +|---|---|---|---| +| **PR0 (prerequisite, §1.1a)** | **Multi-node multi-group topology.** Either (a) extend the bootstrap/flag surface so a group's voter set can span more than one node at startup (per-group members in `--raftGroups` or a companion `--raftBootstrapMembers`-style flag that lifts the `len(groups)==1` guard, `main.go:742-748`), or (b) document + harness the runtime `AddVoter`/`PromoteLearner` composition path (`internal/raftengine/engine.go:225-233`) as the supported way to build multi-voter groups, with an integration topology (extend `scripts/run-jepsen-m5-local.sh`'s single-member-per-group model, `:5-20`) that stands up groups with voters on ≥2 nodes. **OQ-9** decides (a) vs (b). | Bootstrap/flag parsing tests (or AddVoter-composition harness); integration smoke that a group has voters on ≥2 distinct nodes and `TransferLeadershipToServer` succeeds between them. | Yes — unblocks the scheduler; useful on its own (true multi-node multi-group). | +| **PR1** | Leader-count **observation + metrics only**: a leader-identity reader sibling to `publishLeaderTerms` (`main.go:2126-2143`) that builds the per-node leader-count map on the default-group leader, and exports the `leaderbalance_leaders_per_node` gauge + `leaderbalance_enabled=0`. **No transfers.** Pure observability — shippable even before PR0 (it just observes a trivially-balanced single-voter topology). | Unit: leader-map construction from a fake set of per-group `State()`/`Leader()`; gauge registration. | Yes — observe-only, zero behavior change. | +| **PR2** (blocked on PR0) | The pure **policy function** (§3.3) + the **scheduler loop** + **transfer execution** behind `--leaderBalance` (default OFF) + runtime kill switch + startup grace (§3.1) + slog + the transfer-result metrics. Local-leader transfer path (case 1, §3.4) and forwarded path (case 2, single preferred target, leader-validated, §3.5). Eligibility: leader-count map **seeded with all voters at zero** (§3.2) + "no-leader skip" + per-group/global cooldowns + startup grace + conf-change skip + `LeadTransferee != 0` skip. The transfer **executes on the group leader**, which owns **all** target catch-up + liveness validation via a **new engine-side guard** — add `TransferLeadershipToServerIfEligible(ctx, id, addr, maxLag)` to the engine `Admin` interface; its `handleTransferLeadership` body checks the underflow-safe **saturating-threshold** lag gate (`threshold = Commit > maxLag ? Commit - maxLag : 0; eligible = Match >= threshold` — NOT either bare subtraction form: `Match >= Commit - maxLag` underflows on a young group, `Commit - Match <= maxLag` underflows when `Match > Commit` in a quorum > 2 cluster — §3.5 P1 fix) **and** `Progress.RecentActive` against the leader-only `e.rawNode.Status().Progress[target]` *before* `rawNode.TransferLeader`, returning `errLeadershipTransferTargetNotCaughtUp` (mirrors `handlePromoteLearner`, §3.5). **Progress exposure lands here at the guard, applied identically to case 1 (local engine) and case 2 (remote leader) — there is no separate case-1-only `Status.Progress` exposure.** The same `handleTransferLeadership` also gets the execution-time conf-change guard (rejects when `len(e.pendingConfigs) > 0`, read **under `e.pending.Lock()`** — `cancelPendingConfig` mutates from a non-event-loop goroutine, §3.5 / OQ-12; *unconditional*, applies to the operator path too, unlike the `gated`-method-keyed catch-up gate) plus the `PendingConfChange` field on `Status` (also read under `e.pending.Lock()`, OQ-5), **and** the execution-time in-flight-transfer guard (rejects with `errLeadershipTransferInFlight` when `e.rawNode.BasicStatus().LeadTransferee != 0` — read on the event loop, **no new lock**, mirroring the existing drop-path reads at `etcd/engine.go:1567`/`:1593`; *unconditional* like the conf-change guard, since etcd/raft v3.6 silently aborts the in-flight transfer and starts the new one rather than rejecting — `raft.go:1637-1645` — so without this guard a forwarded `gated = true` transfer would silently cancel an operator's or another system's in-flight transfer, §3.5 / OQ-15). The forwarded path routes to the gated `…IfEligible` form via **new `bool gated = 3` + `uint64 max_lag = 4` fields on `RaftAdminTransferLeadershipRequest`** (`proto/service.proto:248-251`, `syntax = "proto3"`, today has only `target_id`/`target_address`; regen is version-pinned, `proto/Makefile:1-3`): the server handler (`internal/raftadmin/server.go:155-172`) calls `…IfEligible(ctx, id, addr, req.MaxLag)` when `req.Gated` (where `req.MaxLag == 0` is the legitimate strictest budget — require `Match >= Commit`) and the un-gated `TransferLeadershipToServer` when `!req.Gated` (the default for every `cmd/raftadmin` call, `parseTransferTarget`, `cmd/raftadmin/main.go:370-378`), so the operator force-transfer path stays ungated and backward-compatible (§3.4 / OQ-4 + codex round-6 P2: a separate `gated` bit, not `max_lag == 0`, is the ungated sentinel, so `--leaderBalanceMaxTargetLag = 0` is the strictest gate, not a disabled one). **Mixed-version forward safety (codex round-5 P2):** the gated handler + the `gated`/`max_lag` fields + the only emitter (the balancer) all ship in **this same PR2 binary**, so a node that can *send* `gated = true` necessarily *honors* it on receive — there is no version skew within a single binary. The cross-node rollout rule (**do not enable `--leaderBalance` until every node runs the PR2 gated binary**, §3.4 / §3.6) covers the heterogeneous-cluster case; the optional `transfer_gate` capability pre-check (OQ-14) is the belt-and-suspenders fallback if operator discipline is violated. | Unit (table-driven): policy decisions over crafted leader maps (imbalance threshold, source/target choice, tie-breaks, strict-spread-decrease guard, cooldown gating, startup-grace gating). Engine-level: `TransferLeadershipToServerIfEligible` rejects a lagging / not-`RecentActive` target *before* `rawNode.TransferLeader` (no proposals paused) and the conf-change guard rejects under a pending conf-change — both run under `-race`; **`TransferLeadershipToServerIfEligible` with `maxLag = 0` rejects a target with `Match < Commit`** (asserts `maxLag = 0` is the strictest gate, not a disable); **the bare `TransferLeadershipToServer` (un-gated method) still transfers to a lagging target** (asserts the catch-up gate does not regress the operator path), while the conf-change guard rejects *both* methods (it is unconditional). **In-flight-transfer guard (codex round-9 P2):** with a transfer to target `A` already in flight (`BasicStatus().LeadTransferee == A`), a second `handleTransferLeadership` to a *different* valid voter `B` is **rejected with `errLeadershipTransferInFlight` before `rawNode.TransferLeader`** — assert the first transfer is *not* aborted (`LeadTransferee` stays `A`, no spurious `errLeadershipTransferAborted` on the first caller) and that the guard fires for **both** the gated and ungated methods (it is unconditional, like the conf-change guard); a companion case asserts that *without* the guard etcd/raft v3.6 would have aborted the in-flight transfer and restarted to `B` (documenting why `errLeadershipTransferRejected` does *not* fire here — `B == target.NodeID` after the abort, `raft.go:1637-1645`). **Underflow cases — both directions (codex round-5 P1 + round-6):** with `Commit < maxLag` (fresh group, e.g. `Commit = 5`, `maxLag = 1024`) a caught-up target (`Match = Commit`) is **accepted** (`threshold = 0`); and with `Match > Commit` (a 5-node group where a follower has appended past the commit index, e.g. `Commit = 80`, `Match = 100`, `maxLag = 1024`) the target is **accepted** (`Match = 100 >= threshold = 0` / `Commit - maxLag` saturated to 0) — the gate uses the saturating threshold `Match >= max(0, Commit - maxLag)`, not either wrapping subtraction form (§5). Server-handler routing: `RaftAdmin.TransferLeadership` with `gated = true` lands on `…IfEligible` (including `gated = true, max_lag = 0` → strictest gate), with `gated` unset/false lands on the un-gated method (backward-compat). Integration: **the PR0 multi-voter-group topology** — force all leaders onto one node, enable `--leaderBalance`, assert convergence to ≤ `⌈groups/nodes⌉` per node. Kill-switch + leader-change-reset + startup-grace tests. | After PR0 — completes count-based balancing. | +| **PR3** | **SQS-refusal / pinning awareness** (§3.5: exclude partitioned-FIFO groups and pinned groups) + the **split-job interlock decision** (rely on cooldowns; add explicit interlock only if a reviewer requires it) + `--leaderBalancePinGroups` + (optional) ordered target preference list on the forward path (OQ-10). | Unit: eligibility excludes refused/pinned groups; property test (`rapid`) — never target a refusing node, never exceed one transfer/cycle, never transfer below threshold. | Yes — hardens the policy; safe to ship after PR2. | +| **Future** | **Load-weighted policy**: consume keyviz per-group load (the M3 signal) behind the same policy interface so the objective becomes "balance leader *load*", not just count. Optional cross-node load fan-out (OQ-1). | Unit: weighted scoring; integration with skewed load. | Depends on keyviz signal availability; policy interface unchanged. | +| **Doc lifecycle** (this PR → `*_partial_*` after PR1; → `*_implemented_*` after PR3) | Lifecycle renames begin with *this* PR (no separate PR is needed for the first rename): `*_proposed_*` → `*_partial_*` after PR1; → `*_implemented_*` after PR3 (count-based complete), with the load-weighted PR tracked as a follow-on. `git mv`, propose date fixed. | — | Doc-only. | + +Each PR carries the five-lens self-review and is gated by its tests + `make lint`. + +## 5. Test strategy + +- **Unit (table-driven, co-located `*_test.go`):** the policy function is pure (§3.3), so tests feed a `nodeID → leaderCount` map (plus eligibility set, cooldown state, config, a fixed `now`) and assert the single `TransferDecision` (or "no decision"): imbalance threshold boundary (spread 1 → none, spread 2 → transfer), source = max / target = min, tie-breaks by node/group ID, strict-spread-decrease guard, per-group and global cooldown gating, pinned/refused exclusion. + - **Source fall-through (codex round-6 P2):** with `A=5` (every `A`-led group pinned / in cooldown / mid-conf-change, so `A` has no eligible group), `B=4`, `C=0`, and an eligible `B`-led group whose voter set includes `C`, the policy must **skip `A` and select the `B → C` transfer** (it drops the spread from 5 to 4), not return "no decision." Companion cases: (a) `A=5` with an eligible `A`-led group whose voters include `C` → the move stays on `A` (`A → C`), confirming fall-through only triggers when the max node has *no* eligible strict-reducing move; (b) `A=5` all-pinned, `B=4` all-pinned, `C=0`, no eligible move anywhere → "no decision" (skip the cycle); (c) `A=5` all-pinned, `B=4` eligible but its only candidate target `D=4` would not strictly reduce the spread → fall through past `B` as well (no strict-reducing pair); (d) **4-node tied-minimum (codex round-8 P2 + claude round-12/13 correction):** `A=5` (all-pinned), `B=4`, `C=0`, `D=0`, an eligible `B`-led group with voters `C` and `D` → try `B→C`: `{A=5, B=3, C=1, D=0}` spread = 5 (rejected — `D` stays at 0, full-map check fails); try `B→D`: `{A=5, B=3, C=0, D=1}` spread = 5 (rejected — `C` stays at 0); no eligible `B` target → policy emits `no_eligible_move`. Contrast with the 3-node main case above: that one has no second bottom node to keep the global minimum at 0, so `B→C` reduces spread 5→4 and IS accepted; the 4-node shape exists precisely to exercise the full-map check's two-tied-minimum-with-pinned-max regression that the source/target-only guard would have admitted. +- **Property tests (`pgregory.net/rapid`):** randomized leader maps + cooldown states; assert invariants — at most one transfer per cycle; a transfer always **strictly decreases** the spread (the §3.3 strict-spread guard means it never merely keeps or increases it); never targets a node not in the group's voter set; never targets a pinned/refusing node; never acts below threshold; **and (codex round-6 P2) if *any* eligible strict-spread-reducing `(source, group, target)` exists, the policy returns a transfer — it never returns "no decision" merely because the single most-loaded node has no eligible group** (the decision's source need not be the absolute-max node; it is the first node in descending-count order with such a pair). The decision is fully deterministic for a fixed input (re-running yields the identical `TransferDecision`). +- **Engine catch-up gate — both underflow directions (codex round-5 P1 + round-6).** A dedicated engine-level test (under `internal/raftengine/etcd/`) for the **saturating-threshold** gate (`threshold = Commit > maxLag ? Commit - maxLag : 0; eligible = Match >= threshold`) must cover both wraparound regimes: + - **`Commit < maxLag` (young group):** a freshly-bootstrapped group with `Commit = 5`, `maxLag = 1024`, target follower at `Match = 5` (fully caught up) → **accepted** (`threshold = 0`). This is the case the `Match >= Commit - maxLag` form would have *rejected* (the subtraction wraps to ≈`2^64`, so `5 >= huge` is false). Also `Commit = 5`, `Match = 0` → **accepted** (`threshold = 0`). + - **`Match > Commit` (quorum > 2, follower appended past commit):** a 5-node group where a follower has acknowledged entries the leader has not yet committed — e.g. `Commit = 80`, `Match = 100`, `maxLag = 1024` → **accepted** (`threshold = max(0, 80 - 1024) = 0`, `100 >= 0`). This is the case the `Commit - Match <= maxLag` form would have *rejected* (`uint64(80) - uint64(100)` wraps to ≈`2^64 - 20`, which is `> maxLag`), wrongly skipping the best transfer target. Also cover the same `Match > Commit` shape with a tighter budget that still must accept: `Commit = 100000`, `Match = 100050`, `maxLag = 1024` → **accepted** (`threshold = 98976`, `100050 >= 98976`). + - **Long-log rejection (the case the gate exists for):** `Commit = 100000`, `Match = 90000`, `maxLag = 1024` → **rejected** (`threshold = 98976`, `90000 < 98976`), with `Match = 99500` → **accepted** (`99500 >= 98976`). + Property variant (`rapid`): for any `Commit ≥ 0`, any `maxLag ≥ 0`, and any `Match ∈ [0, Commit + maxLag]` (the range **deliberately includes `Match > Commit`** — that is exactly where the `Commit - Match` form underflows), the gate decision equals `Match ≥ saturating_sub(Commit, maxLag)` and never panics / never depends on `uint64` wraparound. **Notation caveat (claude round-7 clarity note):** the `max(0, …)` / `saturating_sub` here is *mathematical* — the property test's reference oracle must **not** transcribe it as Go `max(uint64(0), Commit - maxLag)`, because Go evaluates `Commit - maxLag` first and wraps it to a huge `uint64` (then `max(0, huge) = huge`), reintroducing the very bug under test. The reference oracle uses the same explicit saturating branch the engine guard uses (`var threshold uint64; if Commit > maxLag { threshold = Commit - maxLag }; expected := Match >= threshold`, §3.5) — never any subtraction-then-`max` one-liner. +- **Integration (the PR0 multi-voter-group topology, §1.1a):** the convergence test requires a topology where each of N groups has voters on ≥2 of 3 nodes — which only exists after PR0; the single-process multi-group demo has one voter per group and cannot exercise inter-node transfer. Set up the deterministic starting state by **driving explicit `TransferLeadershipToServer` calls to force all leaderships onto one node** (a clean forced imbalance — chosen over "restart the others" so the test does not also absorb snapshot/recovery timing, keeping CI reproducible); enable `--leaderBalance`; assert the cluster converges to no node leading more than `⌈N/3⌉` groups within a bounded number of cycles, and that it then **stays** converged (no ping-pong) for several further cycles. Separately: flip the kill switch mid-convergence and assert no further transfers; kill the default-group leader and assert the new leader resumes balancing only **after** its startup grace elapses (§3.1), with reset state. + - **Forwarded-path (case 2) coverage.** The "all leaders on one node" starting state above puts every leader on the default-group leader, so every transfer it issues is **case 1** (the balancer hosts leads the groups) and the `RaftAdmin` forward path is never exercised. Add a second integration scenario where the forced imbalance starts on a **non-default-group-leader** node (drive `TransferLeadershipToServer` so a follower of the default group holds the excess leaders), enable `--leaderBalance`, and assert the balancer **forwards** the transfer over the per-group `RaftAdmin` gRPC service to that node and the cluster still converges. Also assert the forward-path TOCTOU/rejection mappings (§3.4, §3.5): a stale dial target (leader moved between observation and dial) maps to a logged skip, not a fatal error; a `errLeadershipTransferRejected` / conf-change-pending rejection on the remote leader maps to `leaderbalance_skipped_total` / `leaderbalance_transfers_failed_total` and a retry next cycle. +- **Jepsen note:** existing Jepsen suites (Redis / DynamoDB workloads under `jepsen/`, `scripts/run-jepsen-local.sh`) already churn leadership via nemeses; the balancer adds *controlled* churn on top. The acceptance bar is **no new anomalies with `--leaderBalance` on** — run the existing suites once with the balancer enabled and confirm linearizability/consistency results are unchanged versus the baseline run. The balancer does not introduce a new workload; it is validated as "safe under the existing safety net." + +## 6. Five-lens self-review (to be filled per PR) + +| Lens | Leader-balance-specific risk to check | +|---|---| +| **Data loss** | A leadership transfer is a standard raft operation that loses no committed entry; etcd/raft completes it only after the transferee has caught up (`waitForLeadershipTransfer`, `etcd/engine.go:1377-1405`). The scheduler issues no writes and touches no FSM/Pebble state. Confirm PR1 is observe-only (no transfers). Confirm a transfer that races a conf-change is *skipped*, not forced (§3.5). | +| **Concurrency / distributed** | Scheduler goroutine vs. default-group leadership flip (leader-local state reset + startup grace, §3.1) — verify the fresh balancer cannot fire inside the grace window, breaking the balancer-host transfer-storm loop; transfer racing a natural election (cooldowns + strict-spread guard, §3.3); follower-initiated transfer is impossible and must be forwarded (§3.4). **Forward-path TOCTOU:** the balancer reads `Leader().ID` at observation (step 1) then dials that node at transfer (step 2); if the leader moved between them the dial target is stale and the remote `handleTransferLeadership` rejects with `errLeadershipTransferNotLeader` (`etcd/engine.go:1754-1757`) — the balancer must map this to a logged skip, never a fatal error. **Target catch-up AND liveness are validated by the executing group leader, never by the balancer or by raft's own accept logic** (the balancer is a follower of the group and has no `Progress`; etcd/raft v3.6 *accepts* a lagging voter and stalls writes rather than rejecting, §3.5). Verify both sub-cases go through the gated `TransferLeadershipToServerIfEligible`, which checks the underflow-safe **saturating-threshold** lag gate (`Match >= max(0, Commit - maxLag)` — never either bare subtraction form, both of which wrap `uint64`: `Match >= Commit - maxLag` on a young group, `Commit - Match <= maxLag` when `Match > Commit` in a quorum > 2 cluster, §3.5 P1 fix) + `Progress.RecentActive` *before* `rawNode.TransferLeader` (so a lagging/partitioned target never pauses proposals) and returns `errLeadershipTransferTargetNotCaughtUp`; the case-2 path passes only a preference and treats that rejection / `errLeadershipTransferRejected` / abort as a retryable skip. **`Status.LastContact` must NOT be used for target liveness** — it is identically 0 on a leader (§3.5); `Progress.RecentActive` is the per-peer signal. **Conf-change race is enforced on the executing leader** (`handleTransferLeadership` rejects when `len(e.pendingConfigs) > 0`, §3.5 / OQ-12) — verify the `pendingConfigs` read takes `e.pending.Lock()` (mutated off the event loop by `cancelPendingConfig`, §3.5) so it is race-free under `-race`, and that decision-time `Status.PendingConfChange` is *not* relied on alone for the forwarded path, since the balancer's follower-local `Status` cannot see the remote leader's pending conf-change. **In-flight-transfer race is also enforced on the executing leader** (§3.5 / OQ-15): verify the execution-time in-flight-transfer guard (`e.rawNode.BasicStatus().LeadTransferee != 0` before `rawNode.TransferLeader`, rejecting with `errLeadershipTransferInFlight`) is present in `handleTransferLeadership` and **unconditional** (not keyed on the `gated` marker), so the balancer's case-2 forward path can never silently cancel an operator's `cmd/raftadmin` transfer or another system-initiated transfer — etcd/raft v3.6 *aborts-and-restarts* a second transfer to a different target rather than rejecting it (`raft.go:1637-1645`), so the post-`TransferLeader` `errLeadershipTransferRejected` check (`etcd/engine.go:1765-1768`) does **not** catch this. The decision-time `Status().LeadTransferee != 0` skip is authoritative only in **case 1** (the balancer leads the group); in case 2 it is a no-op (`LeadTransferee` is `0` on a follower, `internal/raftengine/engine.go:78-82`), which is precisely why the execution-time guard is required. Kill-switch toggle race (atomic). Run `go test -race ./kv/... ./internal/raftengine/...` and the matching Jepsen suite (CLAUDE.md lens 2). | +| **Security** | The forward path uses the `RaftAdmin` gRPC service, which is **unauthenticated** today (`--adminTokenFile` gates only `/Admin/`, `adapter/admin_grpc.go:484`,`:498`). Confirm the `RaftAdmin` listener is on a trusted boundary (the inter-node Raft transport boundary); if/when `RaftAdmin` is put behind a token/mTLS (OQ-11), the forward client must present the credential (§3.4). The balancer adds no new authenticated surface in v1. **Mixed-version forward (codex round-5 P2):** verify a forwarded `gated = true` request is never trusted to be gated on the *receiving* side without either the rollout-rule precondition (don't enable `--leaderBalance` until all nodes run the gated binary, §3.4 / §3.6) or the optional `transfer_gate` capability pre-check (OQ-14); an old leader silently drops the unknown fields and executes ungated. | +| **Performance** | Observation is local `Status()`/`Leader()` reads on a 30 s ticker — no Raft round-trips, no per-`Next()` HLC consensus, no Pebble reads (§3.2). One transfer per cycle bounds churn. Metric cardinality is fixed-enum / per-node only (§3.6). No hot-path code path is touched. | +| **Data consistency** | A transfer perturbs the default group's HLC renewal and lease reads transiently but preserves the HLC physical-ceiling invariant (a clean transfer is no different from a natural election; the 3 s window > 1 s renewal guarantees no in-window reissue, §3.5). Lease-read freshness is preserved — the new leader falls back to `LinearizableRead` until its lease re-warms. No route-catalog mutation occurs (the balancer never writes the catalog). | +| **Test coverage** | Pure-policy table tests + `rapid` invariants + 3-node convergence/anti-ping-pong integration + kill-switch + leader-change reset; Jepsen "no new anomalies with balancer on" run recorded. SQS-refusal/pinning exclusion has dedicated tests in PR3. | + +## 7. Future work (out of scope for v1) + +- **Load-weighted balancing.** Replace the count objective with a leader-*load* objective driven by keyviz per-group aggregates (the same signal M3 consumes, `keyviz/sampler.go`), behind the unchanged policy interface. Optionally pull cross-node per-group load via the keyviz cluster fan-out (mirrors M3 OQ-5). +- **Per-node capability publication** so partitioned-FIFO groups can be balanced among htfifo-capable nodes instead of excluded entirely (OQ-2). +- **Shared scheduler runtime** with the M3 auto-split scheduler (one leader-local control loop hosting both) once both ship — reduces duplicated leader-local state-reset and kill-switch plumbing. + +## 8. Open Questions + +1. **OQ-1 — Count vs. load for v1.** v1 balances leader *count*. Is count sufficient as the first deliverable (groups are roughly equal load), or should v1 already weight by keyviz per-group load? (Recommendation: ship count first; it is deterministic and needs no load signal.) (§3.3, §7) +2. **OQ-2 — SQS-refused groups: exclude vs. balance-among-capable.** v1 excludes partitioned-FIFO groups from balancing entirely (the refusal hook already governs them). Should we instead publish per-node `htfifo` capability and balance those groups among the capable subset? (§3.5) +3. **OQ-3 — Default group: balance first, last, or pinned by default?** *Recommendation (§3.5): balance the default group **last*** — admit it to the eligible-source set only when the §3.5 first pass (non-default groups only) found no eligible reducing move this cycle, so the HLC-renewal restart + lease-read blip + balancer-host-churn risk are deferred until the default group is the only remaining source of imbalance. (Round-7 codex P2 replaced the earlier raw-count gate "every other group's source node is at or below `⌈groups / nodes⌉`" with this eligibility-based two-pass rule because the raw count can be permanently above ceil when non-default groups are pinned / refused / mid-conf-change, deadlocking default-group admission — see §3.5.) Operators who want it pinned entirely use `--leaderBalancePinGroups`. This is preferred over "balance first" (perturbs HLC/catalog seat needlessly) and "pin by default" (leaves the HLC/catalog node permanently over-loaded). Confirm before PR2. (§3.5, §3.6) +4. **OQ-4 — Forward path: how to carry the gating intent + `maxLag` to the remote leader (resolved → extend `RaftAdminTransferLeadershipRequest` with `gated` + `max_lag`).** *Decision (§3.4, also resolving codex round-6 P2):* keep reusing the operator-facing `RaftAdmin.TransferLeadership` service and **add a `bool gated = 3` flag plus a `uint64 max_lag = 4` field** (next free numbers; today only `target_id = 1`/`target_address = 2`, `proto/service.proto:248-251`, `syntax = "proto3"`). The server handler (`internal/raftadmin/server.go:155-172`) routes on the **flag**, not on the numeric value: `gated = true` → gated `TransferLeadershipToServerIfEligible(ctx, id, addr, req.MaxLag)` (and `req.MaxLag == 0` is then the *strictest* budget — require `Match >= Commit`); `gated = false` → un-gated `TransferLeadershipToServer` (the existing behavior for every `cmd/raftadmin` call, whose request builder leaves both fields unset, `parseTransferTarget`, `cmd/raftadmin/main.go:370-378`). Routing on a dedicated `gated` bit rather than on `max_lag == 0` is the codex round-6 P2 fix: `max_lag == 0` was a footgun (operator setting `--leaderBalanceMaxTargetLag = 0` would expect the strictest gate but get the gate *disabled*); field **absence** (`gated = false`), not the value `0`, is the only ungated signal. Backward compatible by proto3 construction (both new fields absent ⇒ `gated = false` ⇒ ungated); no `cmd/raftadmin` change; old/new binaries interoperate (an old server drops both unknown fields and runs ungated). Proto regeneration is version-pinned (`proto/Makefile:1-3`). The proto3-explicit-presence alternative (`optional uint64 max_lag = 3` + `Has_max_lag`) is supported by the pinned toolchain (proto3 explicit presence is stable since protobuf 3.15 / protoc-gen-go 1.27) but was **not** chosen because no in-tree `.proto` uses `optional` today and the explicit `gated` bit reads more clearly (§3.4). This was preferred over a separate internal RPC (a second method/service to maintain and authorize) and over unconditionally gating the existing handler (which would silently gate the operator force-transfer path). A future operator-vs-automated **auth/audit** split (e.g. a bearer token on the automated calls) is still open and couples with OQ-11, but it no longer blocks PR2's forwarded path. (§3.4, §3.5) +5. **OQ-5 — Cleanest "conf-change pending" signal per group.** *Recommendation (§3.5):* add `PendingConfChange bool` to the exported `raftengine.Status` from `len(e.pendingConfigs) > 0` (`etcd/engine.go:390`), mirroring how `LeadTransferee` was surfaced — preferred over leaking `rawNode.BasicStatus().Config.PendingConfIndex` or a new `Admin` method. The `Status()` snapshot must read `len(e.pendingConfigs)` **under `e.pending.Lock()`** (it is mutated off the event loop by `cancelPendingConfig`, §3.5). Confirm this is the chosen signal before PR2. +6. **OQ-6 — Split-job interlock vs. cooldown-only.** v1 relies on cooldowns rather than an explicit "pause during active SplitJob" interlock, to keep this scheduler decoupled from M2/M3. Is cooldown-only acceptable, or should PR3 add the interlock behind the shared interface? (§3.5) +7. **OQ-7 — Per-node gauge cardinality.** `leaderbalance_leaders_per_node{node}` is bounded by node count. In very large clusters, prefer the single `leaderbalance_leader_count_spread` scalar instead? (§3.6) +8. **OQ-8 — Default tuning values.** Are `imbalanceThreshold = 2`, per-group cooldown 30 s, global cooldown 10 s, eval interval 30 s the right starting defaults, or should they be derived from the election timeout? (§3.3, §3.6) +9. **OQ-9 — PR0 topology: bootstrap-extension vs. AddVoter-composition.** *Recommendation (§1.1a): option (a) — extend the flag/bootstrap surface.* The transfer-issuing PRs (PR2/PR3) are blocked on a multi-node multi-group topology, which today cannot be declared at startup (`resolveBootstrapServers` rejects multi-group bootstrap, `main.go:742-748`; one address per group, `shard_config.go:14-17`). Lifting the `len(groups)==1` guard and accepting per-group peer lists (in `--raftGroups` or a companion `--raftBootstrapMembers`-style flag) makes the topology first-class at startup and lets every test harness stand up multi-voter groups deterministically without a per-test `AddVoter` dance. The runtime `AddVoter`/`PromoteLearner` composition path (option (b)) **stays valid for live topology expansion** (adding voters after bootstrap) — it just is not the *gating prerequisite* for the scheduler. Confirm before PR0 lands. (§1.1a, §4) +10. **OQ-10 — Ordered target preference on the forward path.** v1 forwards a single preferred target and relies on the remote leader's submit-time rejection for catch-up safety (§3.5). Should PR3 forward an ordered preference list (least→most loaded) so the leader can fall through to the next healthy voter in one round trip — which requires adding a `repeated` preference field to `RaftAdminTransferLeadershipRequest` alongside the OQ-4 `gated = 3` / `max_lag = 4` extension (single `TargetId`/`TargetAddress` today, `internal/raftadmin/server.go:155-172`, `proto/service.proto:248-251`)? (§3.5, §3.4) +11. **OQ-11 — RaftAdmin authentication.** `RaftAdmin` (including `TransferLeadership`/`AddVoter`/`RemoveServer`) is **unauthenticated** today — `--adminTokenFile` gates only `/Admin/` methods, not `/RaftAdmin/` (`adapter/admin_grpc.go:484`, `:498`). The balancer makes programmatic transfers the steady-state use of this surface. v1 requires the `RaftAdmin` listener to stay on a trusted network boundary; should a bearer token / mTLS be added to `RaftAdmin` before the scheduler runs on an untrusted network, and if so threaded into the forward client (§3.4)? (Couples with OQ-4.) +12. **OQ-12 — Conf-change guard placement on the executing leader.** The case-2 conf-change exclusion must be enforced on the node that *executes* the transfer, because `handleTransferLeadership` does not consult `pendingConfigs` today and the balancer's decision-time `Status` does not reflect a remote leader's pending conf-change (§3.5). The recommendation is to add the guard inside `handleTransferLeadership` (`internal/raftengine/etcd/engine.go:1743-1771`), rejecting with a new `errLeadershipTransferConfChangePending`. **Note (claude round-3 #2): the event loop alone does not make `pendingConfigs` consistent for the reader** — `cancelPendingConfig` (`:3158-3165`) deletes entries under `e.pending.Lock()` from a non-event-loop goroutine (the `ctx.Done()` arm of `submitAdminEx`, `:1366-1367`), so the guard must read `len(e.pendingConfigs)` under `e.pending.Lock()` to be race-free under `-race` (same lock for the OQ-5 `Status.PendingConfChange` snapshot). Alternative: check in the `RaftAdmin.TransferLeadership` server wrapper (`internal/raftadmin/server.go:155-172`) via a new `Admin`/`Status` read. Confirm placement before PR2; the engine-loop guard is preferred because it also protects the case-1 *local* call and any other caller (`cmd/raftadmin`), not just the forward path — and it co-locates with the catch-up/liveness guard the same method needs (§3.5). (§3.5) +13. **OQ-13 — Catch-up/liveness guard surface: engine method vs. exported `Progress` (resolved → engine method).** *Recommendation (§3.5, resolving codex P1/P2 + claude #1/#3):* keep the leader-only `Progress` map internal and add a thin engine `Admin` method `TransferLeadershipToServerIfEligible(ctx, id, addr, maxLag)` that runs the underflow-safe **saturating-threshold** lag gate (`threshold = Commit > maxLag ? Commit - maxLag : 0; eligible = Match >= threshold` — NOT either bare subtraction form: `Match >= Commit - maxLag` wraps on a young group, `Commit - Match <= maxLag` wraps when `Match > Commit` in a quorum > 2 cluster, §3.5 P1 fix) + `Progress.RecentActive` check inside `handleTransferLeadership` before `rawNode.TransferLeader` (mirroring `handlePromoteLearner`'s in-engine `Progress.Match` gate, `internal/raftengine/etcd/engine.go:1674-1688`), returning `errLeadershipTransferTargetNotCaughtUp`. Both `TransferLeadershipToServer` and `TransferLeadershipToServerIfEligible` submit the same `adminActionTransferLeadership` request (`:1325-1331`); the `IfEligible` variant threads `maxLag` **plus an "apply gate" marker** onto the `adminRequest`, and the guard runs **whenever that marker is set** (i.e. for every `…IfEligible` call, including `maxLag = 0` which then means the strictest gate `Match >= Commit`) so the bare un-gated method (and the `cmd/raftadmin` operator path it serves) is unaffected — selection is by method/marker, **not** by `maxLag > 0` (codex round-6 P2). Rejected alternative: export `Progress` (or a `PeerLastContact` map) on `raftengine.Status` — a wider public contract that lets callers build stale eligibility decisions. The single method serves case 1 (local engine) and case 2 (remote leader, reached via the forwarded `RaftAdmin.TransferLeadership` call with `gated = true` set, which the server handler routes to the gated form — §3.4 / OQ-4). Confirm before PR2. (§3.5, §3.4, §4) +14. **OQ-14 — Mixed-version forward gating: rollout rule only, or add a capability pre-check? (codex round-5 P2)** An old `RaftAdmin` server silently drops the new `max_lag` field (proto3 unknown field) and executes a forwarded transfer **ungated** (`internal/raftadmin/server.go:155-172`). The design's primary defense is **structural** — the gated handler and the only emitter (the balancer) ship in the same PR2 binary, so a single binary never has skew — plus an **operator rollout rule**: do not enable `--leaderBalance` until every node runs the gated binary (§3.4, §3.6), mirroring the encryption mutator rolling-restart discipline (`main_encryption_admin.go:42-49`). Should PR3 also add a **runtime capability pre-check** before forwarding (advertise a `transfer_gate` bit on a per-node capability report — extend `CapabilityReport` à la the encryption `GetCapability` fan-out, `adapter/encryption_admin.go:304-359`, or add a lightweight `RaftAdmin.GetCapability`, mirroring the SQS `htfifo` poller `adapter/sqs_capability_poller.go:120`), so the balancer **skips** (`leaderbalance_skipped_total{reason=peer_ungated}`) any source-group leader that does not advertise it? The same per-node report could also carry the `htfifo` bit OQ-2 needs, so balancing partitioned-FIFO groups among capable nodes and gating mixed-version forwards would share one mechanism. *Recommendation: ship the structural ordering + rollout rule in PR2 (sufficient when followed); add the capability pre-check in PR3 as defense-in-depth, coupled with OQ-2's per-node capability publication.* (§3.4, §3.6, §3.5; couples with OQ-2) +15. **OQ-15 — In-flight-transfer guard placement and scope (resolved → execution-time guard in `handleTransferLeadership`, unconditional) (codex round-9 P2).** A second `MsgTransferLeader` to a *different* target does not get rejected by etcd/raft v3.6 — it **aborts the in-flight transfer and starts the new one** (`raft.go:1637-1645`), and the in-tree post-call check (`etcd/engine.go:1765-1768`) does **not** catch this (after the restart `BasicStatus().LeadTransferee == target.NodeID`, so the `!=` check is false). The decision-time `Status().LeadTransferee != 0` skip is a no-op in case 2 (the balancer is a follower and reads `0`). *Decision (§3.5):* add an **execution-time guard inside `handleTransferLeadership`** that reads `e.rawNode.BasicStatus().LeadTransferee` before `rawNode.TransferLeader` and rejects a non-zero value with a new `errLeadershipTransferInFlight` (race-free on the event loop, no new lock — the same read is already used at `etcd/engine.go:1567`/`:1593`). **Scope: unconditional** (applies to the gated balancer path *and* the ungated `cmd/raftadmin` operator path), mirroring the conf-change guard's asymmetry vs. the `gated`-keyed catch-up gate — because silently canceling a peer's in-flight transfer is never a legitimate operation (the operator's intentional-override path is the kill switch + manual retry, not abort-by-force), and unconditional also closes the case-1 TOCTOU window. The rejected alternative (gate only the `gated = true` path) would still let `cmd/raftadmin` silently cancel a balancer transfer and leave the balancer's transfers cancelable by a racing operator force-transfer. Alternative placement (check in the `RaftAdmin.TransferLeadership` server wrapper) is rejected for the same reason as OQ-12: the engine-loop guard also protects the case-1 local call and every other caller, and co-locates with the conf-change + catch-up guards the same handler already needs. Confirm before PR2. (§3.5, §4, §6) + +## 9. Lifecycle + +This document begins as `*_proposed_*`. Per CLAUDE.md / `docs/design/README.md`: + +- Rename to `*_partial_*` after the first milestone lands (PR0 topology and/or PR1 observation + metrics), recording which of PR0/PR1 shipped and what remains. +- Rename to `*_implemented_*` after PR3 ships (count-based balancing on a real multi-node topology, with SQS-refusal/pinning awareness complete), with the load-weighted policy tracked as a follow-on. + +Use `git mv` so history follows the rename. The propose date (2026-06-11) and slug stay fixed.