Skip to content

docs(design): scaling roadmap — multi-region, route scale-out, storage tier, coordinator#958

Open
bootjp wants to merge 1 commit into
mainfrom
design/scaling-roadmap-2026-06
Open

docs(design): scaling roadmap — multi-region, route scale-out, storage tier, coordinator#958
bootjp wants to merge 1 commit into
mainfrom
design/scaling-roadmap-2026-06

Conversation

@bootjp

@bootjp bootjp commented Jun 12, 2026

Copy link
Copy Markdown
Owner

Summary

Proposed roadmap doc covering 4 scaling subsystems elastickv will need within one growth cycle. Each subsystem ships its own *_proposed_*.md milestone doc when its work queues up — this file is the shared north star + sequencing constraint.

Subsystems:

  • §3 Routing (1M routes target): delta-watcher streaming, B-tree RouteEngine, batched splits
  • §4 Multi-region (2-3 region active-active): WAN raft tuning, per-region HLC ceiling with monotone merge, region-local catalog mirrors
  • §5 Storage (5-10 TB/shard): SST-ingest snapshot transfer, shared block cache + per-shard tuning, per-shard compactor, S3 snapshot offload
  • §6 Coordinator (5-10x QPS): per-group HLC ceiling (kill default-group SPOF), follower reads, cross-shard 2PC, partitioned lock resolver, leader-proxy circuit breaker

Cross-cutting: every monotone-merge across new boundaries reuses M2 hotspot-split's SetPhysicalCeiling + Observe primitive. Capability bit rollout follows the existing cap_migration_v2 contract.

Sequencing (§8): §6 M1 (per-group HLC) → §3 M1 (delta watcher) → §5 M1/M2 (SST ingest + shared cache) → §3 M2/M3 → §4 (regions) → §6 M2-M5 → §5 M4. §6 M1 first because it removes the default-group HLC SPOF that every later milestone otherwise inherits.

10 open questions in §10 — the apply-pipeline cost of per-group HLC, partitioned-region migration policy, and cross-shard txn lock ordering need design-level decisions before the first milestone PR opens.

Self-review

  1. Data loss — N/A doc-only; the doc itself preserves M2 hotspot-split's monotone-merge contract for every new boundary
  2. Concurrency / distributed failures — surfaced as OQ (per-group HLC propose cost, partitioned-region policy, cross-shard txn deadlock prevention)
  3. Performance — SLO targets quantified per subsystem (§3.1, §4.1, §5.1, §6.1)
  4. Data consistency — explicitly: every new HLC boundary uses the same SetPhysicalCeiling primitive; capability bits gate every wire change
  5. Test coverage — §9 table maps each subsystem to unit/property tests + Jepsen workloads (existing route-shuffle extended to 100k routes; new multi-region partition workload; existing workloads with 10x data)

Test plan

  • doc reads coherent end-to-end
  • sequencing graph in §8 has no cycles
  • cross-refs to existing M2 hotspot-split design verified
  • reviewer feedback on §10 open questions (these are the design-level decisions blocking first milestone PR)

…e tier, coordinator

elastickv today is sized for single DC, low-thousands of shards,
single-digit-TB per shard, and leader-bound serve paths. Each of
those four limits is reachable from the existing production envelope
within one growth cycle — and the four are coupled.

This roadmap captures the SLO targets, current-state breakage points,
per-subsystem milestone designs (high-level), cross-cutting concerns
(HLC ceiling composability, capability bits, observability), and the
sequencing constraint between them. Each milestone below ships its
own *_proposed_*.md doc when its work is queued — this file is the
shared north star.

Subsystems covered:

1. Routing / shard scale-out (target 1M routes)
   - M1 delta-watcher + streaming WatchCatalog (replace 100ms poll +
     full snapshot rebuild)
   - M2 B-tree RouteEngine + per-group index + COW history overlay
   - M3 BatchSplitRange for high-rate hotspot-split automation

2. Multi-region / cross-DC (target 2-3 region active-active)
   - M1 WAN raft tuning + region-aware membership
   - M2 per-region HLC ceiling with monotone merge on migration
     (reuses M2 hotspot-split SetPhysicalCeiling + Observe primitive)
   - M3 per-region catalog mirror via raft learners
   - M4 cross-region disaster recovery

3. Storage tier (target 5-10 TB per shard)
   - M1 SST-ingest snapshot transfer (kill the full-iter stream)
   - M2 shared block cache + per-shard pebble tuning + L0
     backpressure → ErrShardWriteBackpressure
   - M3 per-shard compactor jittered + 100k version cap + hot-key
     priority
   - M4 S3 snapshot offload for DR

4. Coordinator / API gateway (target 5-10x QPS)
   - M1 per-group HLC ceiling (kill default-group single point of
     failure); cap_per_group_hlc_v1
   - M2 follower-read path with max_staleness_ms contract
   - M3 cross-shard 2PC (lifts ErrCrossShardTransactionNotSupported)
   - M4 partitioned lock resolver (followers do the scan, leader
     applies)
   - M5 leader-proxy circuit breaker (kill 200-retry storm)

The sequencing (§8) puts §6 M1 first because it removes the
default-group HLC single point of failure that every later milestone
otherwise inherits. §3 M1 second because route-count growth feeds
both §4 (regions) and §6 M3 (cross-shard txn rebalancing freedom).

Cross-cutting invariant: every monotone-merge across the new
boundaries (per-shard, per-region) reuses the SetPhysicalCeiling +
Observe primitive that M2 hotspot-split (PR #945) specifies. The
capability bit rollout pattern follows the same M2 cap_migration_v2
contract.

10 open questions in §10 are reviewer-targets — particularly the
per-group HLC apply-pipeline cost (§10.1), the partitioned-region
migration policy (§10.2), and the cross-shard txn lock ordering
(§10.4) need design-level decisions before the first milestone PR
opens.
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@bootjp, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 33 minutes and 15 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more credits in the billing tab to continue.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3e07a4c4-3e95-4f52-bafd-ddad9a7b66bd

📥 Commits

Reviewing files that changed from the base of the PR and between 688ba8a and 2dda7d0.

📒 Files selected for processing (1)
  • docs/design/2026_06_12_proposed_scaling_roadmap.md
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch design/scaling-roadmap-2026-06

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2dda7d0b36

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +437 to +438
- Followers serve reads at a bounded staleness, with the staleness
guaranteed by the existing lease infrastructure. A follower-read

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Don't base follower reads on leader-only leases

When implementing §6 M2, followers cannot use the existing lease path: Coordinate.LeaseRead in kv/coordinator.go only returns the fast path when the engine state is StateLeader, and its secondary lease check also requires leader state. In this scenario the proposed follower path either remains leader-bound and misses the 5x read-QPS target, or bypasses the existing checks and serves reads without follower-visible quorum/closed-timestamp evidence. Please make this milestone depend on a follower-safe timestamp/read-index mechanism instead of the current leader-local lease infrastructure.

Useful? React with 👍 / 👎.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request proposes a comprehensive scaling roadmap for elastickv, detailing milestones across routing scale-out, multi-region active-active support, storage tier optimizations, and coordinator throughput improvements. The reviewer provided valuable feedback on several architectural details: warning that spawning a goroutine per Raft group for HLC lease renewal will cause severe CPU thrashing at scale; highlighting a potential state machine stall if transaction status lookups are performed synchronously during lock-resolver delegation; pointing out a conflict between the proposed 100k version cap and the 30-minute retention window for high-throughput keys; and recommending atomic copy-on-write catalog updates to prevent readers from observing transiently inconsistent route states during delta application.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +431 to +432
- `ShardedCoordinator.RunHLCLeaseRenewal` becomes a per-group loop
(one goroutine per group, each ticking 1 s independently).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Running a dedicated goroutine per Raft group for HLC lease renewal (ShardedCoordinator.RunHLCLeaseRenewal) will not scale well when the cluster grows to the targeted 100k–1M routes/groups. Spawning and waking up hundreds of thousands of goroutines every second will cause severe CPU thrashing and scheduler overhead.

Instead of a per-group loop with its own goroutine, consider managing the renewals via a centralized timer wheel or a single background worker goroutine that iterates over or batches the active groups led by the node.

Suggested change
- `ShardedCoordinator.RunHLCLeaseRenewal` becomes a per-group loop
(one goroutine per group, each ticking 1 s independently).
- ShardedCoordinator.RunHLCLeaseRenewal managed via a centralized timer wheel or a single background worker (rather than a dedicated goroutine per group) to avoid scheduler overhead at scale.

Comment on lines +468 to +469
- Net: the leader is no longer the single scan point — followers
do the scan, leaders just apply the resolved batches.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When delegating the lock-resolver scan to a follower, the design needs to clarify where the primary transaction status lookup (which requires querying the primary shard, potentially on a remote node/region) is performed:

  1. If the follower performs the lookup before proposing the batch, the follower must execute remote RPCs.
  2. If the leader performs the lookup after receiving the proposed batch of expired locks, the leader must execute those RPCs.

In either case, performing these lookups synchronously within the Raft apply loop is a major hazard that can stall the state machine. The design should explicitly state that transaction status lookups must be performed asynchronously outside the Raft execution path, with only the final, fully-resolved commit/abort decisions proposed/applied to the log.

Suggested change
- Net: the leader is no longer the single scan point — followers
do the scan, leaders just apply the resolved batches.
- Net: the leader is no longer the single scan point — followers do the scan and asynchronously resolve transaction statuses against primary shards before proposing the final resolved batches.

Comment on lines +378 to +380
- Per-key version cap: from 1 M (`maxSnapshotVersionCount`)
down to 100 k AND a compactor cycle that prioritises hot keys
(closes the 555 writes/s 30-min hot-key cap regression).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Lowering the per-key version cap from 1M down to 100k (line 378) while maintaining a 30-minute retention window (line 374) introduces a conflict for high-throughput keys.

At 100k versions, any key written at more than 100,000 / 1800 ≈ 55 writes/second will exceed the cap within the 30-minute window. If the compactor enforces the 100k cap by discarding older versions, it will violate the 30-minute retention guarantee for those hot keys. The design should clarify how the system resolves the conflict between the hard version cap and the time-based retention window for high-write-rate keys.

Suggested change
- Per-key version cap: from 1 M (`maxSnapshotVersionCount`)
down to 100 k AND a compactor cycle that prioritises hot keys
(closes the 555 writes/s 30-min hot-key cap regression).
- Per-key version cap: dynamically scale or bound the cap based on the write rate to ensure the 30-min retention window is not violated for hot keys, while prioritizing hot-key compaction.

Comment on lines +201 to +202
- Watcher pulls deltas, applies under fine-grained per-route lock
(per-route version vector instead of a global slice rebuild).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Applying deltas under fine-grained per-route locks (lines 201-202) can lead to transiently inconsistent catalog states where some route updates are visible to readers while others are not.

Since the Composed-1 cross-version-read gate and the history ring (M2) rely on consistent point-in-time RouteHistorySnapshot views, any non-atomic application of deltas could cause routing anomalies or spurious validation failures. The delta application should be batched and committed atomically to the RouteEngine (e.g., via a copy-on-write version transition) so that readers always observe a fully consistent catalog version.

Suggested change
- Watcher pulls deltas, applies under fine-grained per-route lock
(per-route version vector instead of a global slice rebuild).
- Watcher pulls deltas, applies them to a copy-on-write version overlay, and publishes the new catalog version atomically to prevent readers from observing transiently inconsistent route states.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant