docs(design): propose multi-node multi-group bootstrap#955
Conversation
|
Warning Review limit reached
More reviews will be available in 47 minutes and 37 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more credits in the billing tab to continue. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@claude review |
|
Claude finished @bootjp's task in 2m 37s —— View job Code Review —
|
| Claim | Verified |
|---|---|
groupSpec is id uint64; address string, shard_config.go:14-17 |
✓ |
resolveBootstrapServers guard at main.go:746-748, error at :736 |
✓ |
buildRuntimeForGroup passes single bootstrapServers to every group, multiraft_runtime.go:234 |
✓ |
Transport built only if len(peers) > 1, factory.go:49-52 |
✓ |
bootstrap = *raftBootstrap || len(bootstrapServers) > 0, main.go:534 |
✓ |
errClusterMismatch at engine.go:116 |
✓ |
persisted peers loaded at factory.go:43-47 |
✓ |
demo.go:200-219 etcd all-same-list bootstrap + comment |
✓ |
parseSQSFifoPartitionMap ;-separator precedent, shard_config.go:174-196 |
✓ |
parseRaftBootstrapMembers duplicate-id check, shard_config.go:373-375 |
✓ |
Open questions: recommendations
| OQ | Recommendation | Rationale |
|---|---|---|
| OQ-1 (companion flag) | Companion --raftGroupPeers ✓ |
Clean separation already well-argued in §3.1 |
OQ-2 (--raftBootstrap implied) |
Implied ✓ | Consistent with main.go:534 — non-empty peers already implies bootstrap there |
| OQ-3 (map vs. func) | map[uint64][]Server |
Simpler; eager validation at startup is better than deferred |
| OQ-4 (heterogeneous) | Hard error in v1 ✓ | Forward-compatible (grammar already expresses it; relax later with a targeted consumer) |
| OQ-5 (learner interop) | Voter-only in v1 ✓ | Learner bootstrap is a separate concern; mixing at bootstrap adds ordering complexity |
| OQ-6 (per-group vs. shared listener) | Per-group in v1 ✓ | Zero transport change; port count only becomes a problem at M >> typical deployments |
Verdict: Approve. The two substantive issues (data-loss-lens gap in §3.2, per-group bootstrap flag scope gap in PR-B) are addressable in implementation PRs and both are within the mandate of "PR carries the five-lens review." The cross-references are exceptionally well-grounded.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 252df56017
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| We do **not** invent a "lexicographically-smallest peer proposes, others wait-and-join" protocol. That single-proposer pattern is the *AddVoter-composition* path (§5), not the bootstrap path — and adopting it for bootstrap would mean the non-proposer nodes start with an empty conf and must be added one-by-one, which is fragile (ordering, the proposer must be up first and must be leader) and is exactly the "manual AddVoter dance in every test harness" #953 OQ-9 rejected. The all-nodes-same-list model has no designated-proposer ordering requirement: nodes can start in any order, and raft elects a leader once a quorum is up. | ||
|
|
||
| **Idempotency on restart (decision: persisted-peers + marker dir already give this — per group).** On first open of a group dir, `Open` writes the normalized peer set to the persisted-peers file (`savePersistedPeers`, `engine.go:643`; format in `peer_metadata.go:205`). On restart, the factory **loads the persisted peers and uses them in preference to the flag-supplied list** (`factory.go:43-47`), and `Open` refuses to start if the configured cluster disagrees with what is persisted (`validateOpenPeers` → `errClusterMismatch`, `engine.go:632`, `:116`). So: |
There was a problem hiding this comment.
Fix restart mismatch detection before relying on it
For the restart scenario where an operator changes --raftGroupPeers after a group has already written etcd-raft-peers.bin, this design relies on errClusterMismatch to catch the divergent flag list, but the current code discards that list before validation: Factory.Create loads persisted peers and overwrites cfg.Peers, and normalizeOpenConfig does the same before validateOpenPeers runs. As a result, validateOpenPeers compares the snapshot to the persisted list, not the new flag-supplied list, so an implementation following this doc would silently ignore divergent --raftGroupPeers rather than fail fast unless the factory/open path is changed to validate the configured list separately.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request proposes a design for a multi-node multi-group bootstrap mechanism in elastickv, introducing a new --raftGroupPeers flag to configure per-group peer lists at startup. The feedback suggests avoiding the use of a function provider for retrieving bootstrap servers, recommending a static map instead, as the group configuration is static at runtime.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
|
||
| ### 3.2 Bootstrap semantics | ||
|
|
||
| The wiring change is small and local: instead of one process-wide `bootstrapServers` threaded into every group, **resolve a per-group `[]raftengine.Server` and pass each group its own list**. Concretely, `buildShardGroups` / `buildRuntimeForGroup` change from a single `bootstrapServers []raftengine.Server` parameter (`multiraft_runtime.go:234`, `main.go:777`) to a `bootstrapServersFor func(groupID uint64) []raftengine.Server` lookup (or a `map[uint64][]raftengine.Server`), built once from the parsed `--raftGroupPeers`. Everything downstream — the factory's `len(peers) > 1` transport gate (`factory.go:50`), `Open`'s peer normalize/validate/persist (`engine.go:620-643`), the marker dir, the per-group listener — already operates per group and needs no change. |
There was a problem hiding this comment.
To prevent designing for hypothetical future requirements, we should avoid introducing function providers when the value is static at runtime. Since the set of groups is fixed at startup, a static map[uint64][]raftengine.Server should be used instead of a function provider lookup like func(groupID uint64) []raftengine.Server.
| The wiring change is small and local: instead of one process-wide `bootstrapServers` threaded into every group, **resolve a per-group `[]raftengine.Server` and pass each group its own list**. Concretely, `buildShardGroups` / `buildRuntimeForGroup` change from a single `bootstrapServers []raftengine.Server` parameter (`multiraft_runtime.go:234`, `main.go:777`) to a `bootstrapServersFor func(groupID uint64) []raftengine.Server` lookup (or a `map[uint64][]raftengine.Server`), built once from the parsed `--raftGroupPeers`. Everything downstream — the factory's `len(peers) > 1` transport gate (`factory.go:50`), `Open`'s peer normalize/validate/persist (`engine.go:620-643`), the marker dir, the per-group listener — already operates per group and needs no change. | |
| The wiring change is small and local: instead of one process-wide bootstrapServers threaded into every group, resolve a per-group []raftengine.Server and pass each group its own list. Concretely, buildShardGroups / buildRuntimeForGroup change from a single bootstrapServers []raftengine.Server parameter (multiraft_runtime.go:234, main.go:777) to a map[uint64][]raftengine.Server lookup, built once from the parsed --raftGroupPeers. Everything downstream - the factory's len(peers) > 1 transport gate (factory.go:50), Open's peer normalize/validate/persist (engine.go:620-643), the marker dir, the per-group listener - already operates per group and needs no change. |
References
- Do not design for hypothetical future requirements by introducing function providers or atomic variables when the value is static at runtime. Use a static value instead, and upgrade to dynamic/thread-safe types only when a runtime update path is actually introduced.
|
|
||
| 1. **OQ-1 — Flag spelling: `--raftGroupPeers` companion flag vs. extending `--raftGroups` entries.** §3.1 recommends the companion flag (clean separation of "my listener" vs. "the voter set"; mirrors single-group `--raftBootstrapMembers`). Confirm before PR-A, since it fixes the operator-facing surface. | ||
| 2. **OQ-2 — Should `--raftBootstrap` be *required* alongside `--raftGroupPeers`, or implied?** §3.2 recommends implied (non-empty `--raftGroupPeers` ⇒ bootstrap=true per group, matching `bootstrap = *raftBootstrap || len(bootstrapServers) > 0`, `main.go:534`). Alternative: require `--raftBootstrap` explicitly for symmetry with single-group. Confirm before PR-B. | ||
| 3. **OQ-3 — Per-group bootstrap-server carrier: `func(groupID) []Server` vs. `map[uint64][]Server`.** Implementation detail for threading through `buildShardGroups`/`buildRuntimeForGroup` (`main.go:777`, `multiraft_runtime.go:234`). Map is simpler; func defers construction. Either is fine; pick at PR-B. |
There was a problem hiding this comment.
Following our architectural guidelines, we should not design for hypothetical future requirements by introducing function providers when the value is static at runtime. Since the group membership is static at runtime, we should resolve OQ-3 by choosing the static map[uint64][]Server approach and removing the function provider option.
| 3. **OQ-3 — Per-group bootstrap-server carrier: `func(groupID) []Server` vs. `map[uint64][]Server`.** Implementation detail for threading through `buildShardGroups`/`buildRuntimeForGroup` (`main.go:777`, `multiraft_runtime.go:234`). Map is simpler; func defers construction. Either is fine; pick at PR-B. | |
| 3. OQ-3 - Per-group bootstrap-server carrier: map[uint64][]Server. Implementation detail for threading through buildShardGroups/buildRuntimeForGroup (main.go:777, multiraft_runtime.go:234). A static map is simpler and avoids introducing a function provider for values that are static at runtime. |
References
- Do not design for hypothetical future requirements by introducing function providers or atomic variables when the value is static at runtime. Use a static value instead, and upgrade to dynamic/thread-safe types only when a runtime update path is actually introduced.
Issue 1: correct stale line anchors in kv/sharded_coordinator.go - keyviz sampler observeMutation: 1795-1824 -> 1841-1846 - RunHLCLeaseRenewal: 1914-1953 -> 1960-1985; defaultGroup access :1915 -> :1961 Issue 2: soften companion-doc references to in-flight PR #955 form, matching the #945/#951/#953 branch-reference style (3 sites + map line). Issue 3: ground OQ-1 in the actual commitTS logic (nextStartTS/ resolveTxnCommitTS/nextTxnTSAfter all from one c.clock) and annotate §4 step 11 as deferred-pending-OQ-1 with an explicit trigger condition. Inline: fix learner LinearizableRead behaviour (engine returns ErrNotLeader, caller forwards; engine.go:1583); note shared-startTS invariant; add OQ-7 for the single-node->multi-node live cutover / rolling-upgrade strategy.
Summary
Doc-only PR. Proposes multi-node multi-group bootstrap — the prerequisite for true distributed multi-Raft scaling. Today a real multi-node deployment can only run single-group:
resolveBootstrapServersrejects--raftBootstrapMemberswhenlen(groups) != 1(main.go:744-748),groupSpeccarries one address per group (shard_config.go:14-17), andbuildRuntimeForGroupthreads the same (nil-in-multi-group)bootstrapServersinto every group (multiraft_runtime.go:234-254), so each group bootstraps single-member and the factory never even builds a transport (factory.go:49-52). This is exactly the PR0 / OQ-9 "option (a)" gap the leader-balance design (#953) identified.New file:
docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md(Status: Proposed / Author: bootjp / Date: 2026-06-12). Every current-state claim is cited at file:line, verified againstmain.Key design points
--raftGroupPeers "1=n1@host1:5051,n2@host2:5051,...;2=..."(per-groupraftID@host:portvoter lists), lifting thelen(groups)==1guard.--raftGroups/--raftBootstrapMembers/--raftBootstrapand thecmd/server/demo.godemo are unchanged; mutually exclusive with--raftBootstrapMembers; full validation rules (local-node-present, addr-match, homogeneity).cmd/server/demo.go:204-219already uses for single-group), NOT a designated proposer. Idempotent restart and divergent-config fail-fast come free from the existing persisted-peers +errClusterMismatch+raft-enginemarker machinery (already per-group).5005{1..6}convention). No transport change — only feeding per-group lists in.Open questions
OQ-1 flag spelling (companion flag vs. extend
--raftGroups); OQ-2--raftBootstraprequired vs. implied; OQ-3 per-group carrier (map vs. func); OQ-4 heterogeneous membership (hard error vs. warning in v1); OQ-5 learner-start interop; OQ-6 single multiplexed raft listener for very high M.Test plan
Doc-only — no code, no tests run. The doc itself proposes the test strategy: unit (flag-parse table tests, per-group resolution, restart idempotency), integration (in-process 3-node × 2-group harness asserting multi-voter
Configuration()+ inter-nodeTransferLeadershipToServer), and a later Jepsen milestone (true multi-node multi-group runner).