docs(design): scaling roadmap — multi-region, route scale-out, storage tier, coordinator by bootjp · Pull Request #958 · bootjp/elastickv

bootjp · 2026-06-12T06:46:29Z

Summary

Proposed roadmap doc covering 4 scaling subsystems elastickv will need within one growth cycle. Each subsystem ships its own *_proposed_*.md milestone doc when its work queues up — this file is the shared north star + sequencing constraint.

Subsystems:

§3 Routing (1M routes target): delta-watcher streaming, B-tree RouteEngine, batched splits
§4 Multi-region (2-3 region active-active): WAN raft tuning, per-region HLC ceiling with monotone merge, region-local catalog mirrors
§5 Storage (5-10 TB/shard): SST-ingest snapshot transfer, shared block cache + per-shard tuning, per-shard compactor, S3 snapshot offload
§6 Coordinator (5-10x QPS): per-group HLC ceiling (kill default-group SPOF), follower reads, cross-shard 2PC, partitioned lock resolver, leader-proxy circuit breaker

Cross-cutting: every monotone-merge across new boundaries reuses M2 hotspot-split's SetPhysicalCeiling + Observe primitive. Capability bit rollout follows the existing cap_migration_v2 contract.

Sequencing (§8): §6 M1 (per-group HLC) → §3 M1 (delta watcher) → §5 M1/M2 (SST ingest + shared cache) → §3 M2/M3 → §4 (regions) → §6 M2-M5 → §5 M4. §6 M1 first because it removes the default-group HLC SPOF that every later milestone otherwise inherits.

10 open questions in §10 — the apply-pipeline cost of per-group HLC, partitioned-region migration policy, and cross-shard txn lock ordering need design-level decisions before the first milestone PR opens.

Self-review

Data loss — N/A doc-only; the doc itself preserves M2 hotspot-split's monotone-merge contract for every new boundary
Concurrency / distributed failures — surfaced as OQ (per-group HLC propose cost, partitioned-region policy, cross-shard txn deadlock prevention)
Performance — SLO targets quantified per subsystem (§3.1, §4.1, §5.1, §6.1)
Data consistency — explicitly: every new HLC boundary uses the same SetPhysicalCeiling primitive; capability bits gate every wire change
Test coverage — §9 table maps each subsystem to unit/property tests + Jepsen workloads (existing route-shuffle extended to 100k routes; new multi-region partition workload; existing workloads with 10x data)

Test plan

doc reads coherent end-to-end
sequencing graph in §8 has no cycles
cross-refs to existing M2 hotspot-split design verified
reviewer feedback on §10 open questions (these are the design-level decisions blocking first milestone PR)

…e tier, coordinator elastickv today is sized for single DC, low-thousands of shards, single-digit-TB per shard, and leader-bound serve paths. Each of those four limits is reachable from the existing production envelope within one growth cycle — and the four are coupled. This roadmap captures the SLO targets, current-state breakage points, per-subsystem milestone designs (high-level), cross-cutting concerns (HLC ceiling composability, capability bits, observability), and the sequencing constraint between them. Each milestone below ships its own *_proposed_*.md doc when its work is queued — this file is the shared north star. Subsystems covered: 1. Routing / shard scale-out (target 1M routes) - M1 delta-watcher + streaming WatchCatalog (replace 100ms poll + full snapshot rebuild) - M2 B-tree RouteEngine + per-group index + COW history overlay - M3 BatchSplitRange for high-rate hotspot-split automation 2. Multi-region / cross-DC (target 2-3 region active-active) - M1 WAN raft tuning + region-aware membership - M2 per-region HLC ceiling with monotone merge on migration (reuses M2 hotspot-split SetPhysicalCeiling + Observe primitive) - M3 per-region catalog mirror via raft learners - M4 cross-region disaster recovery 3. Storage tier (target 5-10 TB per shard) - M1 SST-ingest snapshot transfer (kill the full-iter stream) - M2 shared block cache + per-shard pebble tuning + L0 backpressure → ErrShardWriteBackpressure - M3 per-shard compactor jittered + 100k version cap + hot-key priority - M4 S3 snapshot offload for DR 4. Coordinator / API gateway (target 5-10x QPS) - M1 per-group HLC ceiling (kill default-group single point of failure); cap_per_group_hlc_v1 - M2 follower-read path with max_staleness_ms contract - M3 cross-shard 2PC (lifts ErrCrossShardTransactionNotSupported) - M4 partitioned lock resolver (followers do the scan, leader applies) - M5 leader-proxy circuit breaker (kill 200-retry storm) The sequencing (§8) puts §6 M1 first because it removes the default-group HLC single point of failure that every later milestone otherwise inherits. §3 M1 second because route-count growth feeds both §4 (regions) and §6 M3 (cross-shard txn rebalancing freedom). Cross-cutting invariant: every monotone-merge across the new boundaries (per-shard, per-region) reuses the SetPhysicalCeiling + Observe primitive that M2 hotspot-split (PR #945) specifies. The capability bit rollout pattern follows the same M2 cap_migration_v2 contract. 10 open questions in §10 are reviewer-targets — particularly the per-group HLC apply-pipeline cost (§10.1), the partitioned-region migration policy (§10.2), and the cross-shard txn lock ordering (§10.4) need design-level decisions before the first milestone PR opens.

coderabbitai · 2026-06-12T06:46:38Z

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 33 minutes and 15 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more credits in the billing tab to continue.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3e07a4c4-3e95-4f52-bafd-ddad9a7b66bd

📥 Commits

Reviewing files that changed from the base of the PR and between 688ba8a and 2dda7d0.

📒 Files selected for processing (1)

docs/design/2026_06_12_proposed_scaling_roadmap.md

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch design/scaling-roadmap-2026-06

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2dda7d0b36

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-12T06:48:16Z

+- Followers serve reads at a bounded staleness, with the staleness
+  guaranteed by the existing lease infrastructure. A follower-read


Don't base follower reads on leader-only leases

When implementing §6 M2, followers cannot use the existing lease path: Coordinate.LeaseRead in kv/coordinator.go only returns the fast path when the engine state is StateLeader, and its secondary lease check also requires leader state. In this scenario the proposed follower path either remains leader-bound and misses the 5x read-QPS target, or bypasses the existing checks and serves reads without follower-visible quorum/closed-timestamp evidence. Please make this milestone depend on a follower-safe timestamp/read-index mechanism instead of the current leader-local lease infrastructure.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request proposes a comprehensive scaling roadmap for elastickv, detailing milestones across routing scale-out, multi-region active-active support, storage tier optimizations, and coordinator throughput improvements. The reviewer provided valuable feedback on several architectural details: warning that spawning a goroutine per Raft group for HLC lease renewal will cause severe CPU thrashing at scale; highlighting a potential state machine stall if transaction status lookups are performed synchronously during lock-resolver delegation; pointing out a conflict between the proposed 100k version cap and the 30-minute retention window for high-throughput keys; and recommending atomic copy-on-write catalog updates to prevent readers from observing transiently inconsistent route states during delta application.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-12T06:48:29Z

+- `ShardedCoordinator.RunHLCLeaseRenewal` becomes a per-group loop
+  (one goroutine per group, each ticking 1 s independently).


Running a dedicated goroutine per Raft group for HLC lease renewal (ShardedCoordinator.RunHLCLeaseRenewal) will not scale well when the cluster grows to the targeted 100k–1M routes/groups. Spawning and waking up hundreds of thousands of goroutines every second will cause severe CPU thrashing and scheduler overhead.

Instead of a per-group loop with its own goroutine, consider managing the renewals via a centralized timer wheel or a single background worker goroutine that iterates over or batches the active groups led by the node.

Suggested change

- `ShardedCoordinator.RunHLCLeaseRenewal` becomes a per-group loop

(one goroutine per group, each ticking 1 s independently).

- ShardedCoordinator.RunHLCLeaseRenewal managed via a centralized timer wheel or a single background worker (rather than a dedicated goroutine per group) to avoid scheduler overhead at scale.

gemini-code-assist · 2026-06-12T06:48:29Z

+- Net: the leader is no longer the single scan point — followers
+  do the scan, leaders just apply the resolved batches.


When delegating the lock-resolver scan to a follower, the design needs to clarify where the primary transaction status lookup (which requires querying the primary shard, potentially on a remote node/region) is performed:

If the follower performs the lookup before proposing the batch, the follower must execute remote RPCs.

If the leader performs the lookup after receiving the proposed batch of expired locks, the leader must execute those RPCs.

In either case, performing these lookups synchronously within the Raft apply loop is a major hazard that can stall the state machine. The design should explicitly state that transaction status lookups must be performed asynchronously outside the Raft execution path, with only the final, fully-resolved commit/abort decisions proposed/applied to the log.

Suggested change

- Net: the leader is no longer the single scan point — followers

do the scan, leaders just apply the resolved batches.

- Net: the leader is no longer the single scan point — followers do the scan and asynchronously resolve transaction statuses against primary shards before proposing the final resolved batches.

gemini-code-assist · 2026-06-12T06:48:29Z

+- Per-key version cap: from 1 M (`maxSnapshotVersionCount`)
+  down to 100 k AND a compactor cycle that prioritises hot keys
+  (closes the 555 writes/s 30-min hot-key cap regression).


Lowering the per-key version cap from 1M down to 100k (line 378) while maintaining a 30-minute retention window (line 374) introduces a conflict for high-throughput keys.

At 100k versions, any key written at more than 100,000 / 1800 ≈ 55 writes/second will exceed the cap within the 30-minute window. If the compactor enforces the 100k cap by discarding older versions, it will violate the 30-minute retention guarantee for those hot keys. The design should clarify how the system resolves the conflict between the hard version cap and the time-based retention window for high-write-rate keys.

Suggested change

- Per-key version cap: from 1 M (`maxSnapshotVersionCount`)

down to 100 k AND a compactor cycle that prioritises hot keys

(closes the 555 writes/s 30-min hot-key cap regression).

- Per-key version cap: dynamically scale or bound the cap based on the write rate to ensure the 30-min retention window is not violated for hot keys, while prioritizing hot-key compaction.

gemini-code-assist · 2026-06-12T06:48:29Z

+- Watcher pulls deltas, applies under fine-grained per-route lock
+  (per-route version vector instead of a global slice rebuild).


Applying deltas under fine-grained per-route locks (lines 201-202) can lead to transiently inconsistent catalog states where some route updates are visible to readers while others are not.

Since the Composed-1 cross-version-read gate and the history ring (M2) rely on consistent point-in-time RouteHistorySnapshot views, any non-atomic application of deltas could cause routing anomalies or spurious validation failures. The delta application should be batched and committed atomically to the RouteEngine (e.g., via a copy-on-write version transition) so that readers always observe a fully consistent catalog version.

Suggested change

- Watcher pulls deltas, applies under fine-grained per-route lock

(per-route version vector instead of a global slice rebuild).

- Watcher pulls deltas, applies them to a copy-on-write version overlay, and publishes the new catalog version atomically to prevent readers from observing transiently inconsistent route states.

chatgpt-codex-connector Bot reviewed Jun 12, 2026

View reviewed changes

gemini-code-assist Bot reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(design): scaling roadmap — multi-region, route scale-out, storage tier, coordinator#958

docs(design): scaling roadmap — multi-region, route scale-out, storage tier, coordinator#958
bootjp wants to merge 1 commit into
mainfrom
design/scaling-roadmap-2026-06

bootjp commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026

Review limit reached

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 12, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		- Followers serve reads at a bounded staleness, with the staleness
		guaranteed by the existing lease infrastructure. A follower-read

		- `ShardedCoordinator.RunHLCLeaseRenewal` becomes a per-group loop
		(one goroutine per group, each ticking 1 s independently).

	- `ShardedCoordinator.RunHLCLeaseRenewal` becomes a per-group loop
	(one goroutine per group, each ticking 1 s independently).
	- ShardedCoordinator.RunHLCLeaseRenewal managed via a centralized timer wheel or a single background worker (rather than a dedicated goroutine per group) to avoid scheduler overhead at scale.

		- Net: the leader is no longer the single scan point — followers
		do the scan, leaders just apply the resolved batches.

		- Watcher pulls deltas, applies under fine-grained per-route lock
		(per-route version vector instead of a global slice rebuild).

	- Watcher pulls deltas, applies under fine-grained per-route lock
	(per-route version vector instead of a global slice rebuild).
	- Watcher pulls deltas, applies them to a copy-on-write version overlay, and publishes the new catalog version atomically to prevent readers from observing transiently inconsistent route states.

Conversation

bootjp commented Jun 12, 2026

Summary

Self-review

Test plan

Uh oh!

coderabbitai Bot commented Jun 12, 2026

Review limit reached

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant