docs(design): scaling roadmap — multi-region, route scale-out, storage tier, coordinator#958
docs(design): scaling roadmap — multi-region, route scale-out, storage tier, coordinator#958bootjp wants to merge 1 commit into
Conversation
…e tier, coordinator
elastickv today is sized for single DC, low-thousands of shards,
single-digit-TB per shard, and leader-bound serve paths. Each of
those four limits is reachable from the existing production envelope
within one growth cycle — and the four are coupled.
This roadmap captures the SLO targets, current-state breakage points,
per-subsystem milestone designs (high-level), cross-cutting concerns
(HLC ceiling composability, capability bits, observability), and the
sequencing constraint between them. Each milestone below ships its
own *_proposed_*.md doc when its work is queued — this file is the
shared north star.
Subsystems covered:
1. Routing / shard scale-out (target 1M routes)
- M1 delta-watcher + streaming WatchCatalog (replace 100ms poll +
full snapshot rebuild)
- M2 B-tree RouteEngine + per-group index + COW history overlay
- M3 BatchSplitRange for high-rate hotspot-split automation
2. Multi-region / cross-DC (target 2-3 region active-active)
- M1 WAN raft tuning + region-aware membership
- M2 per-region HLC ceiling with monotone merge on migration
(reuses M2 hotspot-split SetPhysicalCeiling + Observe primitive)
- M3 per-region catalog mirror via raft learners
- M4 cross-region disaster recovery
3. Storage tier (target 5-10 TB per shard)
- M1 SST-ingest snapshot transfer (kill the full-iter stream)
- M2 shared block cache + per-shard pebble tuning + L0
backpressure → ErrShardWriteBackpressure
- M3 per-shard compactor jittered + 100k version cap + hot-key
priority
- M4 S3 snapshot offload for DR
4. Coordinator / API gateway (target 5-10x QPS)
- M1 per-group HLC ceiling (kill default-group single point of
failure); cap_per_group_hlc_v1
- M2 follower-read path with max_staleness_ms contract
- M3 cross-shard 2PC (lifts ErrCrossShardTransactionNotSupported)
- M4 partitioned lock resolver (followers do the scan, leader
applies)
- M5 leader-proxy circuit breaker (kill 200-retry storm)
The sequencing (§8) puts §6 M1 first because it removes the
default-group HLC single point of failure that every later milestone
otherwise inherits. §3 M1 second because route-count growth feeds
both §4 (regions) and §6 M3 (cross-shard txn rebalancing freedom).
Cross-cutting invariant: every monotone-merge across the new
boundaries (per-shard, per-region) reuses the SetPhysicalCeiling +
Observe primitive that M2 hotspot-split (PR #945) specifies. The
capability bit rollout pattern follows the same M2 cap_migration_v2
contract.
10 open questions in §10 are reviewer-targets — particularly the
per-group HLC apply-pipeline cost (§10.1), the partitioned-region
migration policy (§10.2), and the cross-shard txn lock ordering
(§10.4) need design-level decisions before the first milestone PR
opens.
|
Warning Review limit reached
More reviews will be available in 33 minutes and 15 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more credits in the billing tab to continue. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2dda7d0b36
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - Followers serve reads at a bounded staleness, with the staleness | ||
| guaranteed by the existing lease infrastructure. A follower-read |
There was a problem hiding this comment.
Don't base follower reads on leader-only leases
When implementing §6 M2, followers cannot use the existing lease path: Coordinate.LeaseRead in kv/coordinator.go only returns the fast path when the engine state is StateLeader, and its secondary lease check also requires leader state. In this scenario the proposed follower path either remains leader-bound and misses the 5x read-QPS target, or bypasses the existing checks and serves reads without follower-visible quorum/closed-timestamp evidence. Please make this milestone depend on a follower-safe timestamp/read-index mechanism instead of the current leader-local lease infrastructure.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request proposes a comprehensive scaling roadmap for elastickv, detailing milestones across routing scale-out, multi-region active-active support, storage tier optimizations, and coordinator throughput improvements. The reviewer provided valuable feedback on several architectural details: warning that spawning a goroutine per Raft group for HLC lease renewal will cause severe CPU thrashing at scale; highlighting a potential state machine stall if transaction status lookups are performed synchronously during lock-resolver delegation; pointing out a conflict between the proposed 100k version cap and the 30-minute retention window for high-throughput keys; and recommending atomic copy-on-write catalog updates to prevent readers from observing transiently inconsistent route states during delta application.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| - `ShardedCoordinator.RunHLCLeaseRenewal` becomes a per-group loop | ||
| (one goroutine per group, each ticking 1 s independently). |
There was a problem hiding this comment.
Running a dedicated goroutine per Raft group for HLC lease renewal (ShardedCoordinator.RunHLCLeaseRenewal) will not scale well when the cluster grows to the targeted 100k–1M routes/groups. Spawning and waking up hundreds of thousands of goroutines every second will cause severe CPU thrashing and scheduler overhead.
Instead of a per-group loop with its own goroutine, consider managing the renewals via a centralized timer wheel or a single background worker goroutine that iterates over or batches the active groups led by the node.
| - `ShardedCoordinator.RunHLCLeaseRenewal` becomes a per-group loop | |
| (one goroutine per group, each ticking 1 s independently). | |
| - ShardedCoordinator.RunHLCLeaseRenewal managed via a centralized timer wheel or a single background worker (rather than a dedicated goroutine per group) to avoid scheduler overhead at scale. |
| - Net: the leader is no longer the single scan point — followers | ||
| do the scan, leaders just apply the resolved batches. |
There was a problem hiding this comment.
When delegating the lock-resolver scan to a follower, the design needs to clarify where the primary transaction status lookup (which requires querying the primary shard, potentially on a remote node/region) is performed:
- If the follower performs the lookup before proposing the batch, the follower must execute remote RPCs.
- If the leader performs the lookup after receiving the proposed batch of expired locks, the leader must execute those RPCs.
In either case, performing these lookups synchronously within the Raft apply loop is a major hazard that can stall the state machine. The design should explicitly state that transaction status lookups must be performed asynchronously outside the Raft execution path, with only the final, fully-resolved commit/abort decisions proposed/applied to the log.
| - Net: the leader is no longer the single scan point — followers | |
| do the scan, leaders just apply the resolved batches. | |
| - Net: the leader is no longer the single scan point — followers do the scan and asynchronously resolve transaction statuses against primary shards before proposing the final resolved batches. |
| - Per-key version cap: from 1 M (`maxSnapshotVersionCount`) | ||
| down to 100 k AND a compactor cycle that prioritises hot keys | ||
| (closes the 555 writes/s 30-min hot-key cap regression). |
There was a problem hiding this comment.
Lowering the per-key version cap from 1M down to 100k (line 378) while maintaining a 30-minute retention window (line 374) introduces a conflict for high-throughput keys.
At 100k versions, any key written at more than 100,000 / 1800 ≈ 55 writes/second will exceed the cap within the 30-minute window. If the compactor enforces the 100k cap by discarding older versions, it will violate the 30-minute retention guarantee for those hot keys. The design should clarify how the system resolves the conflict between the hard version cap and the time-based retention window for high-write-rate keys.
| - Per-key version cap: from 1 M (`maxSnapshotVersionCount`) | |
| down to 100 k AND a compactor cycle that prioritises hot keys | |
| (closes the 555 writes/s 30-min hot-key cap regression). | |
| - Per-key version cap: dynamically scale or bound the cap based on the write rate to ensure the 30-min retention window is not violated for hot keys, while prioritizing hot-key compaction. |
| - Watcher pulls deltas, applies under fine-grained per-route lock | ||
| (per-route version vector instead of a global slice rebuild). |
There was a problem hiding this comment.
Applying deltas under fine-grained per-route locks (lines 201-202) can lead to transiently inconsistent catalog states where some route updates are visible to readers while others are not.
Since the Composed-1 cross-version-read gate and the history ring (M2) rely on consistent point-in-time RouteHistorySnapshot views, any non-atomic application of deltas could cause routing anomalies or spurious validation failures. The delta application should be batched and committed atomically to the RouteEngine (e.g., via a copy-on-write version transition) so that readers always observe a fully consistent catalog version.
| - Watcher pulls deltas, applies under fine-grained per-route lock | |
| (per-route version vector instead of a global slice rebuild). | |
| - Watcher pulls deltas, applies them to a copy-on-write version overlay, and publishes the new catalog version atomically to prevent readers from observing transiently inconsistent route states. |
Summary
Proposed roadmap doc covering 4 scaling subsystems elastickv will need within one growth cycle. Each subsystem ships its own
*_proposed_*.mdmilestone doc when its work queues up — this file is the shared north star + sequencing constraint.Subsystems:
Cross-cutting: every monotone-merge across new boundaries reuses M2 hotspot-split's
SetPhysicalCeiling + Observeprimitive. Capability bit rollout follows the existingcap_migration_v2contract.Sequencing (§8): §6 M1 (per-group HLC) → §3 M1 (delta watcher) → §5 M1/M2 (SST ingest + shared cache) → §3 M2/M3 → §4 (regions) → §6 M2-M5 → §5 M4. §6 M1 first because it removes the default-group HLC SPOF that every later milestone otherwise inherits.
10 open questions in §10 — the apply-pipeline cost of per-group HLC, partitioned-region migration policy, and cross-shard txn lock ordering need design-level decisions before the first milestone PR opens.
Self-review
Test plan