Skip to content

docs(design): propose multi-node multi-group bootstrap#955

Open
bootjp wants to merge 1 commit into
mainfrom
design/multinode-multigroup-bootstrap
Open

docs(design): propose multi-node multi-group bootstrap#955
bootjp wants to merge 1 commit into
mainfrom
design/multinode-multigroup-bootstrap

Conversation

@bootjp

@bootjp bootjp commented Jun 12, 2026

Copy link
Copy Markdown
Owner

Summary

Doc-only PR. Proposes multi-node multi-group bootstrap — the prerequisite for true distributed multi-Raft scaling. Today a real multi-node deployment can only run single-group: resolveBootstrapServers rejects --raftBootstrapMembers when len(groups) != 1 (main.go:744-748), groupSpec carries one address per group (shard_config.go:14-17), and buildRuntimeForGroup threads the same (nil-in-multi-group) bootstrapServers into every group (multiraft_runtime.go:234-254), so each group bootstraps single-member and the factory never even builds a transport (factory.go:49-52). This is exactly the PR0 / OQ-9 "option (a)" gap the leader-balance design (#953) identified.

New file: docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md (Status: Proposed / Author: bootjp / Date: 2026-06-12). Every current-state claim is cited at file:line, verified against main.

Key design points

  • Flag surface: new companion --raftGroupPeers "1=n1@host1:5051,n2@host2:5051,...;2=..." (per-group raftID@host:port voter lists), lifting the len(groups)==1 guard. --raftGroups/--raftBootstrapMembers/--raftBootstrap and the cmd/server/demo.go demo are unchanged; mutually exclusive with --raftBootstrapMembers; full validation rules (local-node-present, addr-match, homogeneity).
  • Bootstrap semantics: all-nodes-same-list (the etcd model cmd/server/demo.go:204-219 already uses for single-group), NOT a designated proposer. Idempotent restart and divergent-config fail-fast come free from the existing persisted-peers + errClusterMismatch + raft-engine marker machinery (already per-group).
  • Transport/addressing: transport is already per-group and address-map driven; keep one listener per group (the 5005{1..6} convention). No transport change — only feeding per-group lists in.
  • Alternative considered: AddVoter-composition stays the live-expansion path, not the bootstrap path (per docs(design): propose leader balance scheduler #953 OQ-9).
  • Cross-doc impact: unblocks leader-balance PR2/PR3 (docs(design): propose leader balance scheduler #953) and hotspot-split M2 cross-group migration testing (docs(design): propose hotspot split M2 — migration plane #945).

Open questions

OQ-1 flag spelling (companion flag vs. extend --raftGroups); OQ-2 --raftBootstrap required vs. implied; OQ-3 per-group carrier (map vs. func); OQ-4 heterogeneous membership (hard error vs. warning in v1); OQ-5 learner-start interop; OQ-6 single multiplexed raft listener for very high M.

Test plan

Doc-only — no code, no tests run. The doc itself proposes the test strategy: unit (flag-parse table tests, per-group resolution, restart idempotency), integration (in-process 3-node × 2-group harness asserting multi-voter Configuration() + inter-node TransferLeadershipToServer), and a later Jepsen milestone (true multi-node multi-group runner).

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 47 minutes and 37 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more credits in the billing tab to continue.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 744bda98-44ae-4f55-bfc7-93bf7024da8a

📥 Commits

Reviewing files that changed from the base of the PR and between 688ba8a and 252df56.

📒 Files selected for processing (1)
  • docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch design/multinode-multigroup-bootstrap

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bootjp

bootjp commented Jun 12, 2026

Copy link
Copy Markdown
Owner Author

@claude review

@claude

claude Bot commented Jun 12, 2026

Copy link
Copy Markdown

Claude finished @bootjp's task in 2m 37s —— View job


Code Review — docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md

  • Gather context and understand request
  • Read design document
  • Cross-check source code references
  • Review design for correctness, completeness, and consistency

Overall: This is a high-quality design proposal. The background analysis is meticulous — every cited file:line was verified against main and is accurate. The problem decomposition (gap → primitives → flag surface → bootstrap semantics → transport → test strategy) is logical and complete. The PR breakdown (A/B/C/D) is well-scoped. The comments below are refinements, not blockers; the doc is ready to land as-is and the open questions are correctly identified.


Findings

1. Data-loss lens: Bootstrap=true passed to already-initialized dirs (§3.2)

The doc states "on restart the persisted-peers path takes over regardless, so the bootstrap flag is moot once a dir has state" — but this relies on an implicit invariant in Open() that is never spelled out in the proposal.

Looking at factory.go:43-47, the factory loads persisted peers and substitutes them for cfg.Peers, but still passes the Bootstrap=true flag through to Open() unchanged. The safety claim hinges on Open() detecting an existing WAL and silently ignoring Bootstrap — but that path is not cited. If Open() does not suppress re-bootstrap when a WAL exists and Bootstrap=true is set, a node restart with --raftGroupPeers still present could re-bootstrap over committed data (CLAUDE.md lens 1: data loss).

Recommendation: Add a sentence to §3.2 citing the specific Open() / WAL-detection path that makes Bootstrap=true a no-op on restart (e.g. the HardState/WAL existence check in engine.go). This needs to be verified at PR-B and called out in that PR's five-lens review. Fix this →


2. Implicit invariant: all nodes must have the same --raftGroupPeers (not validated at first bootstrap)

The doc's validation rules (§3.1) catch local-node-absent and addr-mismatch, and errClusterMismatch (§3.2) fires on restart. But on a first bootstrap with a fresh dir, if node B has a divergent list from nodes A and C for the same group, the errClusterMismatch guard never fires — each node bootstraps independently with its (mutually inconsistent) peer list, forming separate disconnected "clusters" that can never elect a shared leader or form quorum.

This is not a showstopper (it produces an obvious symptom — no leader elected, no progress), but the failure mode should be briefly acknowledged in §3.2's partial-bootstrap section alongside the other recovery paths. A sentence like "nodes bootstrapped with mismatched lists will form disjoint single-group clusters with no shared quorum; recovery is to wipe all fresh dirs and restart with consistent configs" rounds out the failure taxonomy. Optionally, §6.2's integration test assertions could add a negative case.


3. buildShardGroups's single bootstrap bool parameter needs to become per-group (PR-B scope gap)

buildShardGroups takes a single process-wide bootstrap bool and threads it to every group (multiraft_runtime.go:234, main.go:776). The proposal calls for per-group peer lists, but §3.2's "implies bootstrap=true for all groups" decision also means the bootstrap flag needs to become per-group — a group not in --raftGroupPeers should get bootstrap=false (falling back to today's path), while a group that is in --raftGroupPeers gets bootstrap=true.

This is a PR-B scope item not called out explicitly. If PR-B keeps the single bootstrap bool, groups not configured via --raftGroupPeers would inherit the process-wide flag, which may silently bootstrap them as single-member even when --raftBootstrap is false. Recommendation: add an explicit bullet to §6.4 PR-B noting "the process-wide bootstrap bool in buildShardGroups/buildRuntimeForGroup becomes a per-group determination, derived from whether that group has a non-nil resolved peer list."


4. Minor: line reference main.go:744-748 is slightly imprecise

The doc cites main.go:744-748 for the len(groups) != 1 guard. The guard itself is at lines 746–748; lines 743–745 are the preceding if bootstrapMembers == "" early return. The cited range includes lines that belong to a different condition. This is cosmetic, but for a doc whose stated goal is that "every current-state claim is cited at file:line, verified against main", tightening to main.go:746-748 matches the claim more precisely.


5. Operator experience: a concrete 3-node × 2-group flag example would strengthen §3.1

The grammar specification in §3.1 is clear, but the first thing an operator will want to do is grep for "what do I actually type on node 2 of a 3-node × 2-group cluster." A compact example block — showing the full --raftGroups, --raftGroupPeers, and --raftRedisMap flags for each of the 3 nodes — would make the flag surface immediately graspable without having to mentally assemble it from the grammar. The 5005{1..6} port convention is already documented but lives implicitly in §3.4; surfacing it in a concrete example in §3.1 would reduce the cognitive load when reviewing PR-A.

This is a polish suggestion, not a correctness issue.


Verified accurate

Claim Verified
groupSpec is id uint64; address string, shard_config.go:14-17
resolveBootstrapServers guard at main.go:746-748, error at :736
buildRuntimeForGroup passes single bootstrapServers to every group, multiraft_runtime.go:234
Transport built only if len(peers) > 1, factory.go:49-52
bootstrap = *raftBootstrap || len(bootstrapServers) > 0, main.go:534
errClusterMismatch at engine.go:116
persisted peers loaded at factory.go:43-47
demo.go:200-219 etcd all-same-list bootstrap + comment
parseSQSFifoPartitionMap ;-separator precedent, shard_config.go:174-196
parseRaftBootstrapMembers duplicate-id check, shard_config.go:373-375

Open questions: recommendations

OQ Recommendation Rationale
OQ-1 (companion flag) Companion --raftGroupPeers Clean separation already well-argued in §3.1
OQ-2 (--raftBootstrap implied) Implied ✓ Consistent with main.go:534 — non-empty peers already implies bootstrap there
OQ-3 (map vs. func) map[uint64][]Server Simpler; eager validation at startup is better than deferred
OQ-4 (heterogeneous) Hard error in v1 ✓ Forward-compatible (grammar already expresses it; relax later with a targeted consumer)
OQ-5 (learner interop) Voter-only in v1 ✓ Learner bootstrap is a separate concern; mixing at bootstrap adds ordering complexity
OQ-6 (per-group vs. shared listener) Per-group in v1 ✓ Zero transport change; port count only becomes a problem at M >> typical deployments

Verdict: Approve. The two substantive issues (data-loss-lens gap in §3.2, per-group bootstrap flag scope gap in PR-B) are addressable in implementation PRs and both are within the mandate of "PR carries the five-lens review." The cross-references are exceptionally well-grounded.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 252df56017

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


We do **not** invent a "lexicographically-smallest peer proposes, others wait-and-join" protocol. That single-proposer pattern is the *AddVoter-composition* path (§5), not the bootstrap path — and adopting it for bootstrap would mean the non-proposer nodes start with an empty conf and must be added one-by-one, which is fragile (ordering, the proposer must be up first and must be leader) and is exactly the "manual AddVoter dance in every test harness" #953 OQ-9 rejected. The all-nodes-same-list model has no designated-proposer ordering requirement: nodes can start in any order, and raft elects a leader once a quorum is up.

**Idempotency on restart (decision: persisted-peers + marker dir already give this — per group).** On first open of a group dir, `Open` writes the normalized peer set to the persisted-peers file (`savePersistedPeers`, `engine.go:643`; format in `peer_metadata.go:205`). On restart, the factory **loads the persisted peers and uses them in preference to the flag-supplied list** (`factory.go:43-47`), and `Open` refuses to start if the configured cluster disagrees with what is persisted (`validateOpenPeers` → `errClusterMismatch`, `engine.go:632`, `:116`). So:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Fix restart mismatch detection before relying on it

For the restart scenario where an operator changes --raftGroupPeers after a group has already written etcd-raft-peers.bin, this design relies on errClusterMismatch to catch the divergent flag list, but the current code discards that list before validation: Factory.Create loads persisted peers and overwrites cfg.Peers, and normalizeOpenConfig does the same before validateOpenPeers runs. As a result, validateOpenPeers compares the snapshot to the persisted list, not the new flag-supplied list, so an implementation following this doc would silently ignore divergent --raftGroupPeers rather than fail fast unless the factory/open path is changed to validate the configured list separately.

Useful? React with 👍 / 👎.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request proposes a design for a multi-node multi-group bootstrap mechanism in elastickv, introducing a new --raftGroupPeers flag to configure per-group peer lists at startup. The feedback suggests avoiding the use of a function provider for retrieving bootstrap servers, recommending a static map instead, as the group configuration is static at runtime.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.


### 3.2 Bootstrap semantics

The wiring change is small and local: instead of one process-wide `bootstrapServers` threaded into every group, **resolve a per-group `[]raftengine.Server` and pass each group its own list**. Concretely, `buildShardGroups` / `buildRuntimeForGroup` change from a single `bootstrapServers []raftengine.Server` parameter (`multiraft_runtime.go:234`, `main.go:777`) to a `bootstrapServersFor func(groupID uint64) []raftengine.Server` lookup (or a `map[uint64][]raftengine.Server`), built once from the parsed `--raftGroupPeers`. Everything downstream — the factory's `len(peers) > 1` transport gate (`factory.go:50`), `Open`'s peer normalize/validate/persist (`engine.go:620-643`), the marker dir, the per-group listener — already operates per group and needs no change.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent designing for hypothetical future requirements, we should avoid introducing function providers when the value is static at runtime. Since the set of groups is fixed at startup, a static map[uint64][]raftengine.Server should be used instead of a function provider lookup like func(groupID uint64) []raftengine.Server.

Suggested change
The wiring change is small and local: instead of one process-wide `bootstrapServers` threaded into every group, **resolve a per-group `[]raftengine.Server` and pass each group its own list**. Concretely, `buildShardGroups` / `buildRuntimeForGroup` change from a single `bootstrapServers []raftengine.Server` parameter (`multiraft_runtime.go:234`, `main.go:777`) to a `bootstrapServersFor func(groupID uint64) []raftengine.Server` lookup (or a `map[uint64][]raftengine.Server`), built once from the parsed `--raftGroupPeers`. Everything downstream the factory's `len(peers) > 1` transport gate (`factory.go:50`), `Open`'s peer normalize/validate/persist (`engine.go:620-643`), the marker dir, the per-group listener already operates per group and needs no change.
The wiring change is small and local: instead of one process-wide bootstrapServers threaded into every group, resolve a per-group []raftengine.Server and pass each group its own list. Concretely, buildShardGroups / buildRuntimeForGroup change from a single bootstrapServers []raftengine.Server parameter (multiraft_runtime.go:234, main.go:777) to a map[uint64][]raftengine.Server lookup, built once from the parsed --raftGroupPeers. Everything downstream - the factory's len(peers) > 1 transport gate (factory.go:50), Open's peer normalize/validate/persist (engine.go:620-643), the marker dir, the per-group listener - already operates per group and needs no change.
References
  1. Do not design for hypothetical future requirements by introducing function providers or atomic variables when the value is static at runtime. Use a static value instead, and upgrade to dynamic/thread-safe types only when a runtime update path is actually introduced.


1. **OQ-1 — Flag spelling: `--raftGroupPeers` companion flag vs. extending `--raftGroups` entries.** §3.1 recommends the companion flag (clean separation of "my listener" vs. "the voter set"; mirrors single-group `--raftBootstrapMembers`). Confirm before PR-A, since it fixes the operator-facing surface.
2. **OQ-2 — Should `--raftBootstrap` be *required* alongside `--raftGroupPeers`, or implied?** §3.2 recommends implied (non-empty `--raftGroupPeers` ⇒ bootstrap=true per group, matching `bootstrap = *raftBootstrap || len(bootstrapServers) > 0`, `main.go:534`). Alternative: require `--raftBootstrap` explicitly for symmetry with single-group. Confirm before PR-B.
3. **OQ-3 — Per-group bootstrap-server carrier: `func(groupID) []Server` vs. `map[uint64][]Server`.** Implementation detail for threading through `buildShardGroups`/`buildRuntimeForGroup` (`main.go:777`, `multiraft_runtime.go:234`). Map is simpler; func defers construction. Either is fine; pick at PR-B.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Following our architectural guidelines, we should not design for hypothetical future requirements by introducing function providers when the value is static at runtime. Since the group membership is static at runtime, we should resolve OQ-3 by choosing the static map[uint64][]Server approach and removing the function provider option.

Suggested change
3. **OQ-3 Per-group bootstrap-server carrier: `func(groupID) []Server` vs. `map[uint64][]Server`.** Implementation detail for threading through `buildShardGroups`/`buildRuntimeForGroup` (`main.go:777`, `multiraft_runtime.go:234`). Map is simpler; func defers construction. Either is fine; pick at PR-B.
3. OQ-3 - Per-group bootstrap-server carrier: map[uint64][]Server. Implementation detail for threading through buildShardGroups/buildRuntimeForGroup (main.go:777, multiraft_runtime.go:234). A static map is simpler and avoids introducing a function provider for values that are static at runtime.
References
  1. Do not design for hypothetical future requirements by introducing function providers or atomic variables when the value is static at runtime. Use a static value instead, and upgrade to dynamic/thread-safe types only when a runtime update path is actually introduced.

bootjp added a commit that referenced this pull request Jun 12, 2026
Issue 1: correct stale line anchors in kv/sharded_coordinator.go
  - keyviz sampler observeMutation: 1795-1824 -> 1841-1846
  - RunHLCLeaseRenewal: 1914-1953 -> 1960-1985; defaultGroup access :1915 -> :1961
Issue 2: soften companion-doc references to in-flight PR #955 form,
  matching the #945/#951/#953 branch-reference style (3 sites + map line).
Issue 3: ground OQ-1 in the actual commitTS logic (nextStartTS/
  resolveTxnCommitTS/nextTxnTSAfter all from one c.clock) and annotate
  §4 step 11 as deferred-pending-OQ-1 with an explicit trigger condition.
Inline: fix learner LinearizableRead behaviour (engine returns ErrNotLeader,
  caller forwards; engine.go:1583); note shared-startTS invariant; add OQ-7
  for the single-node->multi-node live cutover / rolling-upgrade strategy.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant