diff --git a/docs/design/2026_06_12_proposed_scaling_roadmap.md b/docs/design/2026_06_12_proposed_scaling_roadmap.md
new file mode 100644
index 00000000..00541bd3
--- /dev/null
+++ b/docs/design/2026_06_12_proposed_scaling_roadmap.md
@@ -0,0 +1,513 @@
+# Scaling Roadmap — what exists, what is proposed, what is missing
+
+Status: Proposed
+Author: bootjp
+Date: 2026-06-12
+
+This is a roadmap / implementation-plan document (the kind `docs/design/README.md`
+explicitly admits under "Concrete implementation plans"). It surveys the
+current scaling envelope of elastickv, maps every scaling dimension to the
+design that already addresses it, and identifies the gaps where no design
+exists yet. Every claim about current behaviour below is anchored to a file
+and line on `main` at the propose date; every referenced design doc is one
+that actually exists in `docs/design/` (or, for in-flight work, an open PR
+branch named explicitly).
+
+Scope: this doc proposes **no implementation**. It proposes a sequence and a
+short list of new design docs to write. It does not duplicate the detail of
+the docs it references; read those for mechanism.
+
+---
+
+## 1. Current scaling envelope
+
+What bounds a single elastickv deployment today:
+
+1. **Single-process multi-group only.** Multiple Raft groups run in one
+   process (`--raftGroups id=addr,…`, `shard_config.go:61-99`); each group
+   gets its own engine and its own gRPC listener at `rt.spec.address`
+   (`main.go:1606-1620`, listener at `:1613`). But a *group whose voter set
+   spans more than one node is not deployable from the startup flags*:
+   `resolveBootstrapServers` rejects `--raftBootstrapMembers` whenever
+   `len(groups) != 1` (`main.go:746`, `ErrBootstrapMembersRequireSingleGroup`
+   at `:736`), and `buildRuntimeForGroup` passes the same `bootstrapServers`
+   (which is `nil` in any multi-group config) to every group with
+   `LocalAddress: group.address` (`multiraft_runtime.go:246-254`), so each
+   group in a multi-group process bootstraps as a **single-member** cluster.
+   The multi-group story today is "one process hosts every group, each group
+   has one voter" (the demo / Jepsen M5 model). True multi-node multi-group
+   is not wired (see §2(b), §2(e)).
+
+2. **Per-group memory.** Each Raft group owns a private Pebble store, and
+   `NewPebbleStore` → `defaultPebbleOptionsWithCache` allocates a *fresh*
+   block cache per store (`store/lsm_store.go:273-279`, `pebble.NewCache(pebbleCacheBytes)`
+   at `:274`), default 256 MiB (`defaultPebbleCacheBytes` at `:66`). N groups
+   on a node means N × 256 MiB of block cache alone, plus N memtables, N WALs,
+   N compaction budgets. The TODO at `store/lsm_store.go:117-120` records the
+   intended fix (a process-wide shared cache plumbed through `NewPebbleStore`)
+   but it is a comment, not a design.
+
+3. **Leader concentration.** Leadership of each group is elected
+   independently by etcd/raft; nothing spreads leaderships across nodes. One
+   node can end up leading every group while peers idle, carrying all
+   leader-only work (write proposals, HLC ceiling renewal, lease reads, OCC
+   timestamp issuance, route-catalog proposes). This is the explicit
+   motivation for PR #953 (leader balance scheduler).
+
+4. **Range granularity is fixed after a manual split.** M1 of the hotspot
+   shard split shipped a durable, versioned route catalog and a same-group
+   `SplitRange` RPC (`adapter/distribution_server.go`, `distribution/catalog.go`,
+   `distribution/engine.go`, `distribution/watcher.go`). There is no data
+   movement (cross-group migration is M2, in flight), no auto-detection (M3,
+   in flight), and **no merge** (no design exists). The detection counters in
+   `distribution/engine.go` (`RecordAccess`) are dead code — not called from
+   any request path (confirmed in the M3 doc §1.3).
+
+5. **Cross-group timestamps via per-group HLC, and not even per-group yet.**
+   `ShardedCoordinator.RunHLCLeaseRenewal` proposes the physical-ceiling
+   renewal **only to the default group** (`kv/sharded_coordinator.go:1960-1985`,
+   `group, ok := c.groups[c.defaultGroup]` at `:1961`). A node that leads a
+   non-default group but is not a member of the default group never advances
+   its ceiling from that group — exactly the cross-group monotonicity gap the
+   centralized-TSO doc §1.1 describes. The doc's "near-term fix" (M1: iterate
+   *all* led groups in parallel) is **not yet implemented** on `main`. Today
+   this gap is masked because every group is single-node (§1.1): all FSMs in
+   the process share one `*HLC` (CLAUDE.md "Timestamp Oracle"), so the default
+   group's ceiling covers the whole process. The moment groups span nodes, the
+   gap becomes real.
+
+6. **Large values traverse the Raft log.** Every byte of an S3 object travels
+   through the Raft log as `s3keys.BlobKey` entries (the S3 raft-blob-offload
+   doc §1). WAL and snapshot size scale with object bytes, and follower
+   catch-up re-applies every byte on the single-threaded apply loop.
+
+7. **FSM apply is serial per group.** `kvFSM.Apply` runs on the group's
+   single apply goroutine — "Raft applies are serial" (`kv/fsm.go:123-124`,
+   `Apply` at `:297`). Apply throughput per group is bounded by that one
+   goroutine; the only way to add apply parallelism today is to add groups.
+
+---
+
+## 2. Dimension-by-dimension analysis
+
+Each dimension below lists what bounds it and the design (existing, in-flight,
+or missing) that addresses it.
+
+### (a) Data volume per node
+
+The amount of data one node can hold is bounded by Pebble capacity and by the
+memory each group's private cache/memtable pins.
+
+- **Range split — distribute a range across groups.** Same-group split
+  shipped in M1 (`distribution/`). Cross-group migration (the part that
+  actually relocates data and reduces per-node volume) is **PR #945**
+  (`docs/design/2026_06_11_proposed_hotspot_split_milestone2_migration.md`,
+  branch `docs/hotspot-split-m2-proposal`): a resumable `SplitJob` with
+  `PLANNED → BACKFILL → FENCE → DELTA_COPY → CUTOVER → CLEANUP → DONE` phases
+  driven by a migrator on the default-group leader. Auto-detection /
+  scheduling is **PR #951**
+  (`docs/design/2026_06_11_proposed_hotspot_split_milestone3_automation.md`,
+  branch `design/hotspot-split-m3-automation`), which drives detection off the
+  already-wired keyviz sampler rather than the dead `RecordAccess` counters.
+- **Range MERGE — missing, no design exists.** The inverse of split. Both the
+  parent hotspot doc (§2.2.2) and M2 (§2.2) list merge as an explicit
+  non-goal, and no `*_merge_*.md` exists in `docs/design/` (verified). Why it
+  is needed: without merge, a workload whose hot range cools, or one that
+  over-splits during a transient spike, accumulates permanent route fragments.
+  Each fragment costs a route entry, sampler cardinality, and — once a child
+  lives in its own group — a full group's per-node memory overhead (§1.2). A
+  long-running cluster's route count is monotonically non-decreasing, which is
+  not sustainable. Hard parts to sketch: choosing merge candidates (adjacent
+  routes, both cold, same or adjacent groups); fencing both sides atomically;
+  reconciling the two MVCC histories and the per-key internal keyspace
+  (`!txn|…`, list/hash/set/zset meta) without losing a committed write or an
+  unresolved prepare; cutover that bumps the catalog version exactly like
+  split; and the Composed-1 cross-group commit guard interaction (the same
+  apply-time hazard M2 §7.3 closes for split). Merge is strictly harder than
+  split because it must *unify* two independent commit-timestamp streams, not
+  bisect one.
+- **Blob offload — proposed.**
+  `docs/design/2026_04_25_proposed_s3_raft_blob_offload.md` keeps large object
+  payloads out of the Raft log (chunkref through Raft, chunkblob via a
+  peer-to-peer side channel with semi-synchronous quorum replication). This
+  bounds WAL/snapshot growth to O(manifest), which is the data-volume lever
+  for the S3 surface specifically. Still proposed; the legacy `BlobKey`-on-Raft
+  path is what runs today.
+- **Shared Pebble cache / resource pools — TODO only, promoted to a design
+  item here.** The `store/lsm_store.go:117-120` TODO is the single biggest
+  per-node memory tax as group count grows: N independent 256 MiB caches with
+  no shared eviction pool. This deserves a real design (see §3, Gap 2), not a
+  comment. It blocks high group counts on a single node, which in turn blocks
+  the "many small ranges" model that split (a) produces.
+
+### (b) Write throughput
+
+- **Multi-group** is the primary write-scaling lever: independent Raft
+  pipelines and independent serial apply loops (§1.7). Already in-tree
+  structurally.
+- **Leader balance — PR #953**
+  (`docs/design/2026_06_11_proposed_leader_balance_scheduler.md`, branch
+  `design/leader-balance-scheduler`) spreads group leaderships across nodes so
+  write-proposal load is not pinned to one node. Count-based v1 (TiKV
+  balance-leader equivalent), embedded in the default-group leader, default
+  OFF behind `--leaderBalance`.
+- **Multi-node multi-group bootstrap — GAP.** Leader balance is *blocked on a
+  topology that does not exist*: PR #953 §1.1a ("PR0") documents that no
+  startup wiring produces a group whose voters span more than one node — the
+  `len(groups) != 1` guard at `main.go:746` and the single-member bootstrap in
+  `multiraft_runtime.go:246-254` mean every group in a multi-group process is
+  single-voter, so there is nothing to transfer a leader *to*. Closing this is
+  a prerequisite for write-throughput scaling beyond one node, and a companion
+  proposal is in review as **PR #955**
+  (`docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md`, branch
+  `design/multinode-multigroup-bootstrap`): extend `--raftGroups` / a per-group
+  members flag to accept a multi-node voter set per group, lifting the
+  `len(groups)==1` guard. Once PR #955 lands it is the authoritative spec for
+  this gap; this roadmap treats it as the unblocking dependency for (b), (e),
+  and the region-balance gap in §3.
+
+### (c) Read throughput
+
+- **Lease reads — done.** Leader-served linearizable reads without a Raft
+  round-trip per read (`kv/lease_state.go`,
+  `docs/design/2026_04_20_implemented_lease_read.md`; the lease-read observer
+  is wired in at `main.go:426`). This scales leader read throughput but keeps
+  all reads on the leader.
+- **Learner reads → follower/learner reads.** The learner *primitive* is
+  effectively implemented despite the doc filename still reading
+  `2026_04_26_proposed_raft_learner.md`: `AddLearner` / `PromoteLearner` are
+  on the engine `Admin` interface (`internal/raftengine/engine.go:233`, `:245`)
+  and implemented in the etcd backend (`internal/raftengine/etcd/engine.go:1268`,
+  `:1289`, `ConfChangeAddLearnerNode` handling at `:2550`), the raftadmin CLI
+  has `add_learner` / `promote_learner` (`cmd/raftadmin/main.go:204-206`), and
+  `--raftJoinAsLearner` exists (`main.go:104`). The learner doc itself scopes
+  follower-served reads as an explicit non-goal (§2 "Non-goals", §8 OQ-5):
+  a `LinearizableRead` against a learner is **not** forwarded by the engine —
+  `handleRead` returns `ErrNotLeader` for any non-leader state
+  (`internal/raftengine/etcd/engine.go:1583`), and the *caller* must forward to
+  the leader (`docs/raft_learner_operations.md`, "Serve linearizable reads from
+  the learner"). So the read-replica attach primitive exists, but the design
+  that consumes it for off-leader reads **does not** — **follower-read design
+  is missing**, and it must supply the forwarding/proxy or replica-read API the
+  current primitive deliberately omits. Requirements to
+  sketch (Gap 3): a leader-issued read-timestamp pipeline so a follower/learner
+  serves a snapshot read at a ts the leader has vouched for (CLAUDE.md HLC
+  rule: never use the local wall clock or a follower-issued ts for MVCC
+  visibility / OCC / lease decisions); a staleness bound and a way for the
+  reader to know it has applied past that ts; lease invalidation interaction;
+  and adapter routing so reads can be steered to a replica. The centralized
+  TSO doc §9 Q4 and the learner doc §8 OQ-5 both flag this as the open
+  follow-on.
+
+### (d) Cross-group transactions at scale
+
+- **Per-group HLC vs centralized TSO.** Today the ceiling is renewed only on
+  the default group (§1.5); the TSO doc
+  (`docs/design/2026_04_16_proposed_centralized_tso.md`) proposes first the
+  near-term fix (renew on all led groups, in parallel — its M1) and then a
+  dedicated TSO Raft group with a batch allocator.
+  **Assessment of whether TSO is still needed once groups multiply:** the
+  near-term per-group fix (TSO doc §6) is *necessary regardless* — it is the
+  minimum correctness fix the moment a node leads a non-default group it is
+  member of, and it should land before multi-node multi-group bootstrap (b)
+  makes that topology reachable. The *full* dedicated-TSO component (TSO doc
+  M6–M7) is a larger question: with the per-group fix in place, every node's
+  timestamps are self-monotonic, but **global** monotonicity across nodes
+  leading different groups is still not guaranteed without a shared oracle
+  (TSO doc §6 "Guarantee"). Cross-group transactions (`kv/transaction.go`,
+  `kv/txn_codec.go`) that span groups led by different nodes need a single
+  ordering source for OCC commit-ts comparability. (The shared-snapshot
+  invariant — every operation in one txn reading at the *same* `startTS` — is
+  already upheld: `nextStartTS` allocates one `startTS` for the whole txn and
+  propagates it via `reqs.StartTS` to every participating group; the gap is the
+  cross-*node* comparability of the per-txn `commitTS`, not the per-txn
+  `startTS`. See OQ-1.) So: per-group renewal fix
+  is in-scope-soon and load-bearing; the dedicated TSO group is only justified
+  once (i) multi-node multi-group is real and (ii) cross-group transactions
+  are common enough that the batch-allocator amortization pays for the extra
+  Raft group. Until both hold, the per-group fix plus the shared-`*HLC`
+  property is adequate. The recommendation in §4 sequences the per-group fix
+  early and defers the dedicated TSO group behind real cross-group-txn load.
+
+### (e) Cluster size / membership
+
+- **Learner / conf-change — primitive present.** `AddVoter` / `AddLearner` /
+  `PromoteLearner` / `RemoveServer` exist on the engine and the raftadmin CLI
+  (see (c) / §1). Single-step V1 conf changes only; joint consensus is
+  rejected (`validateConfState`).
+- **Multi-node bootstrap — GAP** (same gap as (b)): the membership primitives
+  exist for *live* expansion (`AddVoter` after bootstrap), but there is no
+  startup wiring to declare a multi-node voter set per group, and no
+  integration harness for multi-voter groups across processes (PR #953 §1.1a;
+  Jepsen M5 runner records "True distributed multi-group is M6+ work",
+  `scripts/run-jepsen-m5-local.sh`).
+- **Auto group creation — explicit non-goal today, long-term.** Both the
+  hotspot parent doc (§2.2.1) and M2/M3 (§2.2) state automatic Raft group
+  creation / membership orchestration is out of scope. The scheduler only
+  targets groups already in `--raftGroups`. Noting it as a long-term item:
+  elastic scale-out (add a node → automatically create/rebalance groups onto
+  it) needs an auto group lifecycle, which depends on multi-node bootstrap (b),
+  region balance (§3 Gap 4), and merge (a) all being in place first.
+
+### (f) Operational scaling
+
+- **keyviz** — the per-route load sampler is wired and allocation-free on the
+  hot path (`observeMutation`, `kv/sharded_coordinator.go:1841-1846`), with proposed extensions
+  for cluster fan-out (`2026_04_27_proposed_keyviz_cluster_fanout.md`),
+  subrange sampling (`2026_05_25_proposed_keyviz_subrange_sampling.md`),
+  hot-key top-K (`2026_05_28_proposed_keyviz_hot_key_topk.md`), and per-cell
+  conflict (implemented). It is the detection signal M3 reuses.
+- **admin** — admin dashboard / data browser / purge-queue are implemented
+  (`2026_04_24_implemented_admin_dashboard.md`,
+  `2026_05_22_implemented_admin_data_browser.md`,
+  `2026_05_16_implemented_admin_purge_queue.md`).
+- **metrics cardinality** — recurring constraint across proposals (S3
+  admission `protocol`/`stage` labels, SQS per-queue counters bounded by a
+  top-N sketch, keyviz `MaxTrackedRoutes` coarsening). Any new per-route /
+  per-queue / per-peer metric must carry a cardinality bound; this is a
+  cross-cutting operational-scaling rule, not a single design.
+- **workload isolation** — `2026_04_24_proposed_workload_isolation.md` (heavy
+  command worker pool, optional Raft-thread pinning, per-client admission,
+  XREAD O(N)→O(new)), S3 PUT admission control
+  (`2026_04_25_proposed_s3_admission_control.md`), SQS per-queue throttling
+  (`2026_04_26_proposed_sqs_per_queue_throttling.md`). These bound *one
+  workload's* impact so it cannot starve the shared runtime / Raft control
+  plane; they scale a deployment by making it predictable under adversarial or
+  unbalanced load rather than by raising raw capacity.
+
+---
+
+## 3. Gap list → recommended new designs (ranked by leverage)
+
+Each gap is a design doc that should be written (`*_proposed_*.md`). Ranked by
+how much further scaling it unlocks. "Depends-on" lists hard prerequisites.
+
+### Gap 1 — Multi-node multi-group bootstrap (highest leverage)
+**Problem.** No startup wiring produces a Raft group whose voters span more
+than one node (`main.go:746` `len(groups) != 1` guard;
+`multiraft_runtime.go:246-254` single-member bootstrap). This blocks
+write-throughput scaling beyond one node, leader balance (nothing to transfer
+to), follower reads (no remote replica to read from), and every cluster-size
+dimension. **Rough milestones:** (M1) extend `--raftGroups` / add a per-group
+members flag to accept a multi-node voter set; lift the guard. (M2)
+integration harness that stands up multi-voter groups across processes. (M3)
+Jepsen multi-group multi-node workload. **Depends-on:** nothing (it is the
+root unblocker). *A companion proposal is in review as **PR #955**
+(`docs/design/2026_06_12_proposed_multinode_multigroup_bootstrap.md`, branch
+`design/multinode-multigroup-bootstrap`); once it lands it is the authoritative
+spec for this gap and this roadmap defers to it.*
+
+### Gap 2 — Shared Pebble cache / resource pools
+**Problem.** Each group's store allocates a private 256 MiB block cache
+(`store/lsm_store.go:273-279`); N groups = N × 256 MiB plus N memtables/WALs,
+with no shared eviction. This caps how many groups (hence how many ranges) one
+node can hold, throttling the "many small ranges" model that split produces.
+**Rough milestones:** (M1) process-wide shared `pebble.Cache` plumbed through
+`NewPebbleStore` (the existing TODO), with per-node sizing. (M2) shared
+memtable / compaction concurrency budget across stores. (M3) per-group
+fairness so one hot group cannot evict everyone else's working set.
+**Depends-on:** none functionally; pairs naturally with Gap 1 (high group
+counts only matter once multi-node groups exist).
+
+### Gap 3 — Follower / learner reads
+**Problem.** Reads are leader-only; the learner primitive exists but its
+consumer (off-leader reads) has no design. Read throughput cannot scale past
+one node's leader capacity. **Rough milestones:** (M1) leader-issued read-ts
+pipeline + a replica apply-watermark so a follower/learner can serve a
+snapshot read at a leader-vouched timestamp (CLAUDE.md HLC rule). (M2) lease
+invalidation interaction + staleness bound + adapter read routing to replicas.
+(M3) Jepsen stale-read bound workload. **Depends-on:** Gap 1 (multi-node
+groups to have a remote replica); the learner primitive (already in-tree).
+
+### Gap 4 — Region (range) balance scheduler
+**Problem.** Leader balance (PR #953) spreads *leaderships*; nothing spreads
+*data* (which node holds which range's replicas). After enough splits, ranges
+pile up unevenly across nodes. This is TiKV's separate balance-region
+scheduler; PR #953 §2.2(3) explicitly excludes replica/data movement.
+**Rough milestones:** (M1) data-balance policy (move a range's replica set from
+an over-full node to an under-full one) reusing the M2 migration plane for the
+actual move. (M2) compose with leader balance so the two schedulers don't
+fight. **Depends-on:** M2 migration plane (PR #945) for the data-movement
+mechanism, **and** Gap 1 (multi-node bootstrap) for somewhere to move replicas
+to. Edge: region balance ⟂ depends on M2 + multi-node bootstrap.
+
+### Gap 5 — Range merge
+**Problem.** No way to recombine ranges; route count is monotonically
+non-decreasing (§2(a)). Over-split or cooled workloads leak route fragments
+and (once children are in their own groups) per-node memory. **Rough
+milestones:** (M1) same-group merge of two adjacent cold routes (mirror M1
+split: catalog CAS + version bump). (M2) cross-group merge (relocate one
+child's data into the other's group, reusing the M2 migration plane in
+reverse). (M3) auto-merge in the M3 detector (cold-route hysteresis).
+**Depends-on:** M2 migration plane for cross-group merge; the Composed-1 guard
+for cutover coherence.
+
+### Gap 6 — Connection / transport scaling (streaming transport revival)
+**Problem.** Raft inter-node messages use unary gRPC per message
+(`docs/design/2026_04_18_proposed_raft_grpc_streaming_transport.md` §1): each
+send pays a full RTT, capping per-peer throughput at ~RTT⁻¹ regardless of
+bandwidth. This bites exactly when multi-node multi-group (Gap 1) puts real
+inter-node Raft traffic on the wire. **Rough milestones:** revive the existing
+streaming-transport proposal (client-streaming `SendStream` per peer, biased
+heartbeat select, backward-compat fallback). The blob-fetch RPC in the S3
+offload doc (§3.6) already wants to reuse the same chunked-streaming
+abstraction, so landing both behind one transport layer is the leverage.
+**Depends-on:** nothing hard; value scales with Gap 1.
+
+### Gap 7 — Auto group lifecycle (longest-term)
+**Problem.** Groups are static (`--raftGroups`). Elastic scale-out (add node →
+auto-create/rebalance groups) needs automatic group creation + membership
+orchestration, an explicit non-goal everywhere today (§2(e)). **Rough
+milestones:** out of near-term scope; sketch only. **Depends-on:** Gaps 1, 4,
+5 all in place (you cannot auto-create groups before you can stand up
+multi-node groups, move ranges between them, and merge fragments).
+
+---
+
+## 4. Sequencing (dependency-ordered rollout)
+
+The ordering is driven by unblock-edges, not by perceived value in isolation.
+
+1. **HLC per-group ceiling renewal fix** (TSO doc §6 / M1). Smallest correct
+   change; closes the cross-group monotonicity gap (§1.5) *before* the
+   topology that exposes it exists. Land first so multi-node groups are safe
+   on arrival.
+2. **Multi-node multi-group bootstrap** (Gap 1 / **PR #955**,
+   `2026_06_12_proposed_multinode_multigroup_bootstrap.md`). The root
+   unblocker for (b), (c), (e), Gap 3, Gap 4. Nothing else multi-node-shaped
+   can land until a group can have voters on more than one node.
+3. **Leader balance scheduler** (PR #953). Its PR0 is exactly Gap 1; PR1
+   (observe-only) can land against today's single-voter topology, but the
+   transfer-issuing PR2–PR3 are blocked on step 2. So: PR #953 PR1 in
+   parallel with step 2; PR2–PR3 after.
+4. **Hotspot split M2 migration plane** (PR #945). The data-movement
+   mechanism every later data-balance/merge step reuses. Independent of the
+   multi-node work for its own correctness (it moves ranges between groups
+   that already exist), so it can proceed in parallel with steps 2–3, but its
+   *value* compounds once groups span nodes.
+5. **Hotspot split M3 automation** (PR #951). Drives detection off keyviz;
+   delivers same-group auto-split standalone (does not require M2), and picks
+   a least-loaded target once M2 lands. After step 4 for the cross-group case.
+6. **Shared Pebble cache** (Gap 2). Needed once split + multi-node lets a node
+   hold many groups; land before pushing high group counts in production.
+7. **Follower / learner reads** (Gap 3). After step 2 (remote replicas exist)
+   and the learner primitive (already in-tree).
+8. **Region balance scheduler** (Gap 4). After step 4 (migration plane) and
+   step 2 (multi-node). Complement to step 3's leader balance.
+9. **Range merge** (Gap 5). After step 4 for cross-group merge.
+10. **Streaming transport** (Gap 6). Any time after step 2 makes inter-node
+    Raft traffic significant; pairs with the S3 blob-fetch RPC.
+11. **Dedicated TSO group** (TSO doc M6–M7) — placed last *on the assumption
+    that the per-group renewal fix (step 1) is sufficient until cross-group
+    transactions across node-spanning groups are common enough to amortize the
+    extra Raft group (§2(d))*. **This placement must be revisited the moment
+    step 2 lands** — see OQ-1. The decision is not an amortization tradeoff but
+    a correctness one: the trigger to pull this forward (potentially to
+    step 3) is **the first cross-group transaction whose participating groups
+    are led by different nodes**. The per-group fix gives per-node monotonicity
+    only (TSO doc §6 "Guarantee"); commit-ts comparability across coordinators
+    on different nodes is not covered by it (OQ-1). Treat step 11 as
+    "deferred-pending-OQ-1-resolution", not "settled-last".
+12. **Auto group lifecycle** (Gap 7) — long-term, after 2/4/8/9.
+
+In-flight PRs map cleanly: **#955** is step 2 (Gap 1 bootstrap proposal),
+**#953** is step 3 (and its PR0 = step 2's intent), **#945** is step 4,
+**#951** is step 5.
+
+---
+
+## 5. Open Questions
+
+1. **Per-group HLC fix vs full TSO ordering.** Is the per-group renewal fix
+   (step 1) sufficient for cross-group OCC correctness in the multi-node
+   topology, or does the first node-spanning cross-group transaction force the
+   dedicated TSO group earlier than step 11?
+
+   **Where the timestamps come from (grounded in code).** A cross-group txn's
+   coordinator issues *both* of its timestamps from one `*HLC` — `c.clock` on
+   the coordinator's node:
+   - `startTS` via `nextStartTS` (`kv/sharded_coordinator.go:1429`):
+     `Observe(maxLatestCommitTS(keys))` then `c.clock.NextFenced()`. The
+     `maxLatestCommitTS` floor is a per-key read against the store — it pins
+     `startTS` above the latest commit on *the keys this txn touches*, nothing
+     more.
+   - `commitTS` via `resolveTxnCommitTS` → `nextTxnTSAfter`
+     (`kv/sharded_coordinator.go:1102`, `:1376`): `c.clock.NextFenced()`,
+     re-allocated after `Observe(startTS)` if it did not strictly exceed
+     `startTS`. A caller-supplied `commitTS` is fed through `c.clock.Observe`
+     to keep the clock monotonic.
+
+   Both calls go through the same `c.clock`. The apply-time OCC/ownership check
+   then compares these timestamps against stored `CommitTS` values (the
+   Composed-1 guard, `docs/design/2026_05_29_partial_composed1_cross_group_commit_guard.md`
+   §4.2(a)/§4.4; FSM `latest > startTS` write-conflict check). OCC
+   serializability depends on those `commitTS` values being **mutually
+   comparable** across all participating groups.
+
+   **Why the per-group fix is necessary but not sufficient.** Today this is
+   safe only because every group shares the *same* process-wide `*HLC`
+   (single-node groups, §1.1/§1.5): one clock issues every timestamp, so all
+   `commitTS` are trivially comparable. The per-group renewal fix (step 1)
+   extends correctness to *one node leading several groups* — but its guarantee
+   is explicitly scoped: "all timestamps issued by a single node are strictly
+   monotonic … monotonicity across nodes that lead *different* groups is not
+   fully guaranteed without a shared TSO" (TSO doc §6 "Guarantee", §1.1). The
+   moment step 2 (multi-node bootstrap) makes it possible for the two groups of
+   a cross-group txn to be coordinated on *different nodes*, two concurrent
+   cross-group txns can draw `commitTS` from two different `*HLC` instances
+   whose ceilings were advanced independently. The `maxLatestCommitTS`/`Observe`
+   floor does not close this: it orders writes only on the specific keys read,
+   not the global commit order two unrelated cross-group txns need to be
+   serializable against each other.
+
+   **Conclusion / trigger.** OQ-1 therefore is *not* a "focused correctness
+   review deferrable until just before step 2 lands" — the per-group fix cannot
+   answer it in the affirmative for the cross-node case. The trigger to pull the
+   dedicated TSO group forward from step 11 is concrete: **the first cross-group
+   transaction whose participating groups are led by different nodes** (i.e. the
+   first real exercise of step 2). Until then the per-group fix + shared-`*HLC`
+   property holds. Open sub-question: whether an interim measure short of the
+   full TSO group — e.g. pinning every cross-group txn's timestamp allocation to
+   a single designated group's leader (the default group's `c.clock`), so all
+   cross-group `commitTS` still come from one clock — can bridge the gap between
+   step 2 and step 11 without the full batch-allocator TSO. That interim option
+   should be evaluated as part of the step-2 design (PR #955) rather than left
+   to step 11.
+2. **Shared cache (Gap 2) vs per-group isolation (workload isolation doc).** A
+   single shared block cache trades isolation for density: one hot group can
+   evict a latency-sensitive group's working set. How does Gap 2's per-group
+   fairness reconcile with the workload-isolation proposal's CPU-side
+   reservations? They are the memory- and CPU-axis siblings and should share a
+   resource-accounting vocabulary.
+3. **Merge (Gap 5) and unresolved prepares.** Merge must unify two MVCC
+   histories *and* two `!txn|…` keyspaces. What is the fence/drain protocol
+   that guarantees no in-flight prepare on either side is lost across the
+   merge cutover? (M2's split drain is the starting point but split bisects one
+   history; merge unifies two.)
+4. **Region balance (Gap 4) signal.** Count-based (ranges per node) like
+   leader balance v1, or size/load-weighted from keyviz from day one? Leader
+   balance chose count-first; data balance may need size from the start
+   because a 1 GiB range and a 1 KiB range are not interchangeable.
+5. **Follower-read staleness contract (Gap 3).** What bound does the adapter
+   surface advertise — bounded-staleness with an explicit lag, or
+   read-your-writes only? This determines whether the leader-issued read-ts
+   pipeline needs a per-client session token.
+6. **Auto group lifecycle (Gap 7) trigger.** What signal creates a new group —
+   node join, aggregate range count crossing a threshold, or operator action?
+   Premature auto-creation interacts badly with merge (create/merge thrash).
+7. **Live cutover / rolling-upgrade strategy for the single-node→multi-node
+   transition (Gap 1).** Moving a deployment from "one process hosts every
+   group, each single-voter" to genuine multi-node multi-group is a topology
+   change, not just a flag flip: a group that bootstrapped single-member must
+   gain voters on other nodes via live `AddVoter` conf-changes (the primitive
+   exists, §2(e)), and the cluster may run mixed binary versions mid-upgrade.
+   What is the supported path — operator-driven `AddVoter` expansion of an
+   existing single-voter group, blue/green with a dual-write proxy
+   (`proxy/`, the existing Redis-migration pattern), or a fresh cluster +
+   data migration? And what version-skew / capability-gating discipline keeps
+   a rolling upgrade safe while only some nodes understand the new per-group
+   members wiring? This is properly the companion bootstrap proposal's (PR #955)
+   to answer; the roadmap flags it here so it is not lost. (Note: the TSO doc
+   §7 already specifies a phased dual-write/shadow-read/feature-flag cutover for
+   the *timestamp* migration; the bootstrap cutover should mirror that
+   structure.)